diff --git a/.github/issue_template.md b/.github/issue_template.md
old mode 100644
new mode 100755
diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md
old mode 100644
new mode 100755
diff --git a/.github/workflows/black.yml b/.github/workflows/black.yml
old mode 100644
new mode 100755
diff --git a/.gitignore b/.gitignore
old mode 100644
new mode 100755
index a2e6a0ba..2557ab1b
--- a/.gitignore
+++ b/.gitignore
@@ -29,3 +29,11 @@ ckpt
 pretrained/
 LLaVA/
 *logs
+temp/
+InternVL/
+logs/
+data/
+llava-video/
+Video-MME/
+VATEX/
+lmms_eval/tasks/vatex/__pycache__/utils.cpython-310.pyc
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
old mode 100644
new mode 100755
diff --git a/README.md b/README.md
old mode 100644
new mode 100755
index 04b62aef..72b15fb1
--- a/README.md
+++ b/README.md
@@ -1,4 +1,4 @@
-<p align="center" width="100%">
+<p align="center" width="80%">
 <img src="https://i.postimg.cc/g0QRgMVv/WX20240228-113337-2x.png"  width="100%" height="70%">
 </p>
 
@@ -6,79 +6,31 @@
 
 > Accelerating the development of large multimodal models (LMMs) with `lmms-eval`
 
-🏠 [Homepage](https://lmms-lab.github.io/) |  🎉 [Blog](https://lmms-lab.github.io/lmms-eval-blog/lmms-eval-0.1/) | 📚 [Documentation](docs/README.md) | 🤗 [Huggingface Datasets](https://huggingface.co/lmms-lab) | <a href="https://emoji.gg/emoji/1684-discord-thread"><img src="https://cdn3.emoji.gg/emojis/1684-discord-thread.png" width="14px" height="14px" alt="Discord_Thread"></a> [discord/lmms-eval](https://discord.gg/zdkwKUqrPy)
-
-
-In today's world, we're on an exciting journey toward creating Artificial General Intelligence (AGI), much like the enthusiasm of the 1960s moon landing. This journey is powered by advanced large language models (LLMs) and large multimodal models (LMMs), which are complex systems capable of understanding, learning, and performing a wide variety of human tasks.
-
-To gauge how advanced these models are, we use a variety of evaluation benchmarks. These benchmarks are tools that help us understand the capabilities of these models, showing us how close we are to achieving AGI. However, finding and using these benchmarks is a big challenge. The necessary benchmarks and datasets are spread out and hidden in various places like Google Drive, Dropbox, and different school and research lab websites. It feels like we're on a treasure hunt, but the maps are scattered everywhere.
-
-In the field of language models, there has been a valuable precedent set by the work of [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness). They offer integrated data and model interfaces, enabling rapid evaluation of language models and serving as the backend support framework for the [open-llm-leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), and has gradually become the underlying ecosystem of the era of foundation models.
-
-However, though there are many new evaluation datasets are recently proposed, the efficient evaluation pipeline of LMM is still in its infancy, and there is no unified evaluation framework that can be used to evaluate LMM across a wide range of datasets. To address this challenge, we introduce **lmms-eval**, an evaluation framework meticulously crafted for consistent and efficient evaluation of LMM.
-
-We humbly obsorbed the exquisite and efficient design of [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness). Building upon its foundation, we implemented our `lmms-eval` framework with performance optimizations specifically for LMMs.
-
-## Necessity of lmms-eval
+🏠 [LMMs-Lab Homepage](https://lmms-lab.github.io/) |  🎉 [Blog](https://lmms-lab.github.io/lmms-eval-blog/lmms-eval-0.1/) | 📚 [Documentation](docs/README.md) | 🤗 [Huggingface Datasets](https://huggingface.co/lmms-lab) | <a href="https://emoji.gg/emoji/1684-discord-thread"><img src="https://cdn3.emoji.gg/emojis/1684-discord-thread.png" width="14px" height="14px" alt="Discord_Thread"></a> [discord/lmms-eval](https://discord.gg/zdkwKUqrPy)
 
-We believe our effort could provide an efficient interface for the detailed comparison of publicly available models to discern their strengths and weaknesses. It's also useful for research institutions and production-oriented companies to accelerate the development of large multimodal models. With the `lmms-eval`, we have significantly accelerated the lifecycle of model iteration. Inside the LLaVA team, the utilization of `lmms-eval` largely improves the efficiency of the model development cycle, as we are able to evaluate weekly trained hundreds of checkpoints on 20-30 datasets, identifying the strengths and weaknesses, and then make targeted improvements.
+---
 
 # Annoucement
 
-## Contribution Guidance
+- [2024-06] 🎬🎬 The `lmms-eval/v0.2` has been upgraded to support video evaluations for video models like LLaVA-NeXT Video and Gemini 1.5 Pro across tasks such as EgoSchema, PerceptionTest, VideoMME, and more. Please refer to the [blog](https://lmms-lab.github.io/posts/lmms-eval-0.2/) for more details
 
-We've added guidance on contributing new datasets and models. Please refer to our [documentation](docs/README.md). If you need assistance, you can contact us via [discord/lmms-eval](https://discord.gg/ebAMGSsS).
+- [2024-03] 📝📝 We have released the first version of `lmms-eval`, please refer to the [blog](https://lmms-lab.github.io/posts/lmms-eval-0.1/) for more details
 
-## v0.1.0 Released
+# Why `lmms-eval`?
 
-The first version of the `lmms-eval` is released. We are working on providing an one-command evaluation suite for accelerating the development of LMMs. 
-
-> In [LLaVA Next](https://llava-vl.github.io/blog/2024-01-30-llava-next/) development, we internally utilize this suite to evaluate the multiple different model versions on various datasets. It significantly accelerates the model development cycle for it's easy integration and fast evaluation speed.
-
-The main feature includes:
-
-<p align="center" width="100%">
-<img src="https://i.postimg.cc/sgzNmJx7/teaser.png"  width="100%" height="80%">
+<p align="center" width="80%">
+<img src="https://i.postimg.cc/L5kNJsJf/Blue-Purple-Futuristic-Modern-3-D-Tech-Company-Business-Presentation.png"  width="100%" height="80%">
 </p>
 
-### One-command evaluation, with detailed logs and samples.
-You can evaluate the models on multiple datasets with a single command. No model/data preparation is needed, just one command line, few minutes, and get the results. Not just a result number, but also the detailed logs and samples, including the model args, input question, model response, and ground truth answer.
-
-```python
-# Evaluating LLaVA on multiple datasets
-accelerate launch --num_processes=8 -m lmms_eval --model llava   --model_args pretrained="liuhaotian/llava-v1.5-7b"   --tasks mme,mmbench_en --batch_size 1 --log_samples --log_samples_suffix llava_v1.5_mme_mmbenchen --output_path ./logs/ #
-```
-
-### Accelerator support and Tasks grouping.
-We support the usage of `accelerate` to wrap the model for distributed evaluation, supporting multi-gpu and tensor parallelism. With **Task Grouping**, all instances from all tasks are grouped and evaluated in parallel, which significantly improves the throughput of the evaluation. After evaluation, all instances are sent to postprocessing module for metric calcuations and potential GPT4-eval queries.
-
-Below are the total runtime on different datasets using 4 x A100 40G.
-
-| Dataset (#num)          | LLaVA-v1.5-7b      | LLaVA-v1.5-13b     |
-| :---------------------- | :----------------- | :----------------- |
-| mme (2374)              | 2 mins 43 seconds  | 3 mins 27 seconds  |
-| gqa (12578)             | 10 mins 43 seconds | 14 mins 23 seconds |
-| scienceqa_img (2017)    | 1 mins 58 seconds  | 2 mins 52 seconds  |
-| ai2d (3088)             | 3 mins 17 seconds  | 4 mins 12 seconds  |
-| coco2017_cap_val (5000) | 14 mins 13 seconds | 19 mins 58 seconds |
-
-### All-In-One HF dataset hubs.
-
-We are hosting more than 40 (and increasing) datasets on [huggingface/lmms-lab](https://huggingface.co/lmms-lab), we carefully converted these datasets from original sources and included all variants, versions and splits. Now they can be directly accessed without any burden of data preprocessing. They also serve for the purpose of visualizing the data and grasping the sense of evaluation tasks distribution.
-
-<p align="center" width="100%">
-<img src="https://i.postimg.cc/8PXFW9sk/WX20240228-123110_2x.png"  width="100%" height="80%">
-</p>
+In today's world, we're on an exciting journey toward creating Artificial General Intelligence (AGI), much like the enthusiasm of the 1960s moon landing. This journey is powered by advanced large language models (LLMs) and large multimodal models (LMMs), which are complex systems capable of understanding, learning, and performing a wide variety of human tasks.
 
-### Detailed Logging Utilites
+To gauge how advanced these models are, we use a variety of evaluation benchmarks. These benchmarks are tools that help us understand the capabilities of these models, showing us how close we are to achieving AGI. 
 
-We provide detailed logging utilities to help you understand the evaluation process and results. The logs include the model args, generation parameters, input question, model response, and ground truth answer. You can also record every details and visualize them inside runs on Weights & Biases.
+However, finding and using these benchmarks is a big challenge. The necessary benchmarks and datasets are spread out and hidden in various places like Google Drive, Dropbox, and different school and research lab websites. It feels like we're on a treasure hunt, but the maps are scattered everywhere.
 
-{% include figure.liquid loading="eager" path="assets/img/wandb_table.png" class="img-fluid rounded z-depth-1" zoomable=true %}
+In the field of language models, there has been a valuable precedent set by the work of [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness). They offer integrated data and model interfaces, enabling rapid evaluation of language models and serving as the backend support framework for the [open-llm-leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), and has gradually become the underlying ecosystem of the era of foundation models.
 
-<p align="center" width="100%">
-<img src="https://i.postimg.cc/W1c1vBDJ/Wechat-IMG1993.png"  width="100%" height="80%">
-</p>
+We humbly obsorbed the exquisite and efficient design of [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) and introduce **lmms-eval**, an evaluation framework meticulously crafted for consistent and efficient evaluation of LMM.
 
 # Installation
 
@@ -95,37 +47,35 @@ pip install -e .
 ```
 
 If you wanted to test llava, you will have to clone their repo from [LLaVA](https://github.com/haotian-liu/LLaVA) and
-```
-git clone https://github.com/haotian-liu/LLaVA
-cd LLaVA
+```bash
+# for llava 1.5
+# git clone https://github.com/haotian-liu/LLaVA
+# cd LLaVA
+# pip install -e .
+
+# for llava-next (1.6)
+git clone https://github.com/LLaVA-VL/LLaVA-NeXT
+cd LLaVA-NeXT
 pip install -e .
 ```
 
+<details>
+<summary>Reproduction of LLaVA-1.5's paper results</summary>
+
 You can check the [environment install script](miscs/repr_scripts.sh) and [torch environment info](miscs/repr_torch_envs.txt) to **reproduce LLaVA-1.5's paper results**. We found torch/cuda versions difference would cause small variations in the results, we provide the [results check](miscs/llava_result_check.md) with different environments.
 
+</details>
+
 If you want to test on caption dataset such as `coco`, `refcoco`, and `nocaps`, you will need to have `java==1.8.0 ` to let pycocoeval api to work. If you don't have it, you can install by using conda
 ```
 conda install openjdk=8
 ```
 you can then check your java version by `java -version` 
 
-# Usage
-```bash
-# Evaluating LLaVA on MME
-accelerate launch --num_processes=8 -m lmms_eval --model llava   --model_args pretrained="liuhaotian/llava-v1.5-7b"   --tasks mme  --batch_size 1 --log_samples --log_samples_suffix llava_v1.5_mme --output_path ./logs/ 
-
-# Evaluating LLaVA on multiple datasets
-accelerate launch --num_processes=8 -m lmms_eval --model llava   --model_args pretrained="liuhaotian/llava-v1.5-7b"   --tasks mme,mmbench_en --batch_size 1 --log_samples --log_samples_suffix llava_v1.5_mme_mmbenchen --output_path ./logs/ #
-
-# For other variants llava. Note that `conv_template` is an arg of the init function of llava in `lmms_eval/models/llava.py`
-accelerate launch --num_processes=8 -m lmms_eval --model llava   --model_args pretrained="liuhaotian/llava-v1.6-mistral-7b,conv_template=mistral_instruct"   --tasks mme,mmbench_en --batch_size 1 --log_samples --log_samples_suffix llava_v1.5_mme_mmbenchen --output_path ./logs/ #
-accelerate launch --num_processes=8 -m lmms_eval --model llava   --model_args pretrained="liuhaotian/llava-v1.6-34b,conv_template=mistral_direct"   --tasks mme,mmbench_en --batch_size 1 --log_samples --log_samples_suffix llava_v1.5_mme_mmbenchen --output_path ./logs/ #
 
-# From a predefined configuration, supporting evaluation of multiple models and datasets
-accelerate launch --num_processes=8 -m lmms_eval --config example_eval.yaml 
-```
-
-# Model Results
+<details>
+<summary>Comprehensive Evaluation Results of LLaVA Family Models</summary>
+<br>
 
 As demonstrated by the extensive table below, we aim to provide detailed information for readers to understand the datasets included in lmms-eval and some specific details about these datasets (we remain grateful for any corrections readers may have during our evaluation process).
 
@@ -137,162 +87,117 @@ We provide a Google Sheet for the detailed results of the LLaVA series models on
 
 We also provide the raw data exported from Weights & Biases for the detailed results of the LLaVA series models on different datasets. You can access the raw data [here](https://docs.google.com/spreadsheets/d/1AvaEmuG4csSmXaHjgu4ei1KBMmNNW8wflOD_kkTDdv8/edit?usp=sharing).
 
-> Development will be continuing on the main branch, and we encourage you to give us feedback on what features are desired and how to improve the library further, or ask questions, either in issues or PRs on GitHub.
+</details>
+<br>
+
+
+Our Development will be continuing on the main branch, and we encourage you to give us feedback on what features are desired and how to improve the library further, or ask questions, either in issues or PRs on GitHub.
+
+# Multiple Usages
 
+**Evaluation of LLaVA on MME**
+
+```bash
+python3 -m accelerate.commands.launch \
+    --num_processes=8 \
+    -m lmms_eval \
+    --model llava \
+    --model_args pretrained="liuhaotian/llava-v1.5-7b" \
+    --tasks mme \
+    --batch_size 1 \
+    --log_samples \
+    --log_samples_suffix llava_v1.5_mme \
+    --output_path ./logs/
+```
+
+**Evaluation of LLaVA on multiple datasets**
+
+```bash
+python3 -m accelerate.commands.launch \
+    --num_processes=8 \
+    -m lmms_eval \
+    --model llava \
+    --model_args pretrained="liuhaotian/llava-v1.5-7b" \
+    --tasks mme,mmbench_en \
+    --batch_size 1 \
+    --log_samples \
+    --log_samples_suffix llava_v1.5_mme_mmbenchen \
+    --output_path ./logs/
+```
+
+**For other variants llava. Please change the `conv_template` in the `model_args`**
+
+> `conv_template` is an arg of the init function of llava in `lmms_eval/models/llava.py`, you could find the corresponding value at LLaVA's code, probably in a dict variable `conv_templates` in `llava/conversations.py`
+
+```bash
+python3 -m accelerate.commands.launch \
+    --num_processes=8 \
+    -m lmms_eval \
+    --model llava \
+    --model_args pretrained="liuhaotian/llava-v1.6-mistral-7b,conv_template=mistral_instruct" \
+    --tasks mme,mmbench_en \
+    --batch_size 1 \
+    --log_samples \
+    --log_samples_suffix llava_v1.5_mme_mmbenchen \
+    --output_path ./logs/
+```
+
+**Evaluation of larger lmms (llava-v1.6-34b)**
+
+```bash
+python3 -m accelerate.commands.launch \
+    --num_processes=8 \
+    -m lmms_eval \
+    --model llava \
+    --model_args pretrained="liuhaotian/llava-v1.6-34b,conv_template=mistral_direct" \
+    --tasks mme,mmbench_en \
+    --batch_size 1 \
+    --log_samples \
+    --log_samples_suffix llava_v1.5_mme_mmbenchen \
+    --output_path ./logs/
+```
+
+**Evaluation with a set of configurations, supporting evaluation of multiple models and datasets**
+
+```bash
+python3 -m accelerate.commands.launch --num_processes=8 -m lmms_eval --config ./miscs/example_eval.yaml
+```
+
+**Evaluation with naive model sharding for bigger model (llava-next-72b)**
+
+```bash
+python3 -m lmms_eval \
+    --model=llava \
+    --model_args=pretrained=lmms-lab/llava-next-72b,conv_template=qwen_1_5,device_map=auto,model_name=llava_qwen \
+    --tasks=pope,vizwiz_vqa_val,scienceqa_img \
+    --batch_size=1 \
+    --log_samples \
+    --log_samples_suffix=llava_qwen \
+    --output_path="./logs/" \
+    --wandb_args=project=lmms-eval,job_type=eval,entity=llava-vl
+```
+
+**Evaluation with SGLang for bigger model (llava-next-72b)**
+
+```bash
+python3 -m lmms_eval \
+	--model=llava_sglang \
+	--model_args=pretrained=lmms-lab/llava-next-72b,tokenizer=lmms-lab/llavanext-qwen-tokenizer,conv_template=chatml-llava,tp_size=8,parallel=8 \
+	--tasks=mme \
+	--batch_size=1 \
+	--log_samples \
+	--log_samples_suffix=llava_qwen \
+	--output_path=./logs/ \
+	--verbosity=INFO
+```
 
 ## Supported models
 
-- GPT4V (API, only generation-based evaluation)
-- LLaVA-v1.5/v1.6-7B/13B/34B (ppl-based, generation-based)
-- Qwen-VL series (ppl-based, generation-based)
-- Fuyu series (ppl-based, generation-based)
-- InstructBLIP series (generation-based)
-
-## Supported datasets
-> () indicates the task name in the lmms_eval. The task name is also used to specify the dataset in the configuration file.
-
-- AI2D (ai2d)
-- ChartQA (chartqa)
-- CMMMU (cmmmu)
-  - CMMMU Validation (cmmmu_val)
-  - CMMMU Test (cmmmu_test)
-- COCO Caption (coco_cap)
-  - COCO 2014 Caption (coco2014_cap)
-    - COCO 2014 Caption Validation (coco2014_cap_val)
-    - COCO 2014 Caption Test (coco2014_cap_test)
-  - COCO 2017 Caption (coco2017_cap)
-    - COCO 2017 Caption MiniVal (coco2017_cap_val)
-    - COCO 2017 Caption MiniTest (coco2017_cap_test)
-- [ConBench](https://github.com/foundation-multimodal-models/ConBench) (conbench)
-- DOCVQA (docvqa)
-  - DOCVQA Validation (docvqa_val)
-  - DOCVQA Test (docvqa_test)
-- Ferret (ferret)
-- Flickr30K (flickr30k)
-  - Ferret Test (ferret_test)
-- GQA (gqa)
-- HallusionBenchmark (hallusion_bench_image)
-- Infographic VQA (info_vqa)
-  - Infographic VQA Validation (info_vqa_val)
-  - Infographic VQA Test (info_vqa_test)
-- LLaVA-Bench (llava_in_the_wild)
-- LLaVA-Bench-COCO (llava_bench_coco)
-- MathVerse (mathverse)
-  - MathVerse Text Dominant (mathverse_testmini_text_dominant)
-  - MathVerse Text Only (mathverse_testmini_text_only)
-  - MathVerse Text Lite (mathverse_testmini_text_lite)
-  - MathVerse Vision Dominant (mathverse_testmini_vision_dominant)
-  - MathVerse Vision Intensive (mathverse_testmini_vision_intensive)
-  - MathVerse Vision Only (mathverse_testmini_vision_only)
-- MathVista (mathvista)
-  - MathVista Validation (mathvista_testmini)
-  - MathVista Test (mathvista_test)
-- MMBench (mmbench)
-  - MMBench English (mmbench_en)
-    - MMBench English Dev (mmbench_en_dev)
-    - MMBench English Test (mmbench_en_test)
-  - MMBench Chinese (mmbench_cn)
-    - MMBench Chinese Dev (mmbench_cn_dev)
-    - MMBench Chinese Test (mmbench_cn_test)
-- MME (mme)
-- MMMU (mmmu)
-  - MMMU Validation (mmmu_val)
-  - MMMU Test (mmmu_test)
-- MMUPD (mmupd)
-  - MMUPD Base (mmupd_base)
-    - MMAAD Base (mmaad_base)
-    - MMIASD Base (mmiasd_base)
-    - MMIVQD Base (mmivqd_base)
-  - MMUPD Option (mmupd_option)
-    - MMAAD Option (mmaad_option)
-    - MMIASD Option (mmiasd_option)
-    - MMIVQD Option (mmivqd_option)
-  - MMUPD Instruction (mmupd_instruction)
-    - MMAAD Instruction (mmaad_instruction)
-    - MMIASD Instruction (mmiasd_instruction)
-    - MMIVQD Instruction (mmivqd_instruction)
-- MMVet (mmvet)
-- Multi-DocVQA (multidocvqa)
-  - Multi-DocVQA Validation (multidocvqa_val)
-  - Multi-DocVQA Test (multidocvqa_test)
-- NoCaps (nocaps)
-  - NoCaps Validation (nocaps_val)
-  - NoCaps Test (nocaps_test)
-- OKVQA (ok_vqa)
-  - OKVQA Validation 2014 (ok_vqa_val2014)
-- POPE (pope)
-- RefCOCO (refcoco)
-    - refcoco_seg
-      - refcoco_seg_test
-      - refcoco_seg_val
-      - refcoco_seg_testA
-      - refcoco_seg_testB
-    - refcoco_bbox
-      - refcoco_bbox_test
-      - refcoco_bbox_val
-      - refcoco_bbox_testA
-      - refcoco_bbox_testB
-    - refcoco_bbox_rec
-      - refcoco_bbox_rec_test 
-      - refcoco_bbox_rec_val
-      - refcoco_bbox_rec_testA
-      - refcoco_bbox_rec_testB
-- RefCOCO+ (refcoco+)
-    - refcoco+_seg
-        - refcoco+_seg_val
-        - refcoco+_seg_testA
-        - refcoco+_seg_testB
-    - refcoco+_bbox
-        - refcoco+_bbox_val
-        - refcoco+_bbox_testA
-        - refcoco+_bbox_testB
-    - refcoco+_bbox_rec
-        - refcoco+_bbox_rec_val
-        - refcoco+_bbox_rec_testA
-        - refcoco+_bbox_rec_testB
-- RefCOCOg (refcocog)
-    - refcocog_seg
-      - refcocog_seg_test
-      - refcocog_seg_val
-    - refcocog_bbox
-      - refcocog_bbox_test
-      - refcocog_bbox_val
-    - refcocog_bbox_rec
-      - refcocog_bbox_rec_test 
-      - refcocog_bbox_rec_val
-- ScienceQA (scienceqa_full)
-  - ScienceQA Full (scienceqa)
-  - ScienceQA IMG (scienceqa_img)
-- ScreenSpot (screenspot)
-  - ScreenSpot REC / Grounding (screenspot_rec)
-  - ScreenSpot REG / Instruction Generation (screenspot_reg)
-- SeedBench (seedbench)
-- SeedBench 2 (seedbench_2)
-- ST-VQA (stvqa)
-- TextCaps (textcaps)
-  - TextCaps Validation (textcaps_val)
-  - TextCaps Test (textcaps_test)
-- TextVQA (textvqa)
-  - TextVQA Validation (textvqa_val)
-  - TextVQA Test (textvqa_test)
-- VizWizVQA (vizwiz_vqa)
-  - VizWizVQA Validation (vizwiz_vqa_val)
-  - VizWizVQA Test (vizwiz_vqa_test)
-- VQAv2 (vqav2)
-  - VQAv2 Validation (vqav2_val)
-  - VQAv2 Test (vqav2_test)
-- WebSRC (websrc)
-  - WebSRC Validation (websrc_val)
-  - WebSRC Test (websrc_test)
-
-## Datasets to be added and tested
-- TallyQA (tallyqa)
-- VSR (vsr)
-- Winoground (winoground)
-- NLVR2 (nlvr2)
-- RavenIQ-Test (raveniq)
-- IconQA (iconqa)
-- VistBench (vistbench)
+Please check [supported models](lmms_eval/models/__init__.py) for more details.
+
+## Supported tasks
+
+Please check [supported tasks](lmms_eval/docs/current_tasks.md) for more details.
 
 # Add Customized Model and Dataset
 
@@ -302,14 +207,43 @@ Please refer to our [documentation](docs/README.md).
 
 lmms_eval is a fork of [lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness). We recommend you to read through the [docs of lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs) for relevant information. 
 
+---
+
 Below are the changes we made to the original API:
 - Build context now only pass in idx and process image and doc during the model responding phase. This is due to the fact that dataset now contains lots of images and we can't store them in the doc like the original lm-eval-harness other wise the cpu memory would explode.
 - Instance.args (lmms_eval/api/instance.py) now contains a list of images to be inputted to lmms.
 - lm-eval-harness supports all HF language models as single model class. Currently this is not possible of lmms because the input/output format of lmms in HF are not yet unified. Thererfore, we have to create a new class for each lmms model. This is not ideal and we will try to unify them in the future.
 
-We also thank:
+---
+
+During the initial stage of our project, we thank:
 - [Xiang Yue](https://xiangyue9607.github.io/), [Jingkang Yang](https://jingkang50.github.io/), [Dong Guo](https://www.linkedin.com/in/dongguoset/) and [Sheng Shen](https://sincerass.github.io/) for early discussion and testing.
 
+---
+
+During the `v0.1` to `v0.2`, we thank the community support from pull requests (PRs):
+
+> Details are in [lmms-eval/v0.2.0 release notes](https://github.com/EvolvingLMMs-Lab/lmms-eval/releases/tag/untagged-9057ff0e9a72d5a5846f)
+
+**Datasets:**
+
+- VCR: Vision_Caption_Restoration (officially from the authors, MILA)
+- ConBench (officially from the authors, PKU/Bytedance)
+- MathVerse (officially from the authors, CUHK)
+- MM-UPD (officially from the authors, University of Tokyo)
+- Multi-lingual MMMU (officially from the authors, CUHK)
+- WebSRC (from Hunter Heiden)
+- ScreeSpot (from Hunter Heiden)
+- RealworldQA (from Fanyi Pu, NTU)
+- Multi-lingual LLaVA-W (from Gagan Bhatia, UBC)
+
+**Models:**
+
+- LLaVA-HF (officially from Huggingface)
+- Idefics-2 (from the lmms-lab team)
+- microsoft/Phi-3-Vision (officially from the authors, Microsoft)
+- LLaVA-SGlang (from the lams-lab team)
+
 ## Citations
 
 ```shell
diff --git a/docs/README.md b/docs/README.md
old mode 100644
new mode 100755
diff --git a/docs/commands.md b/docs/commands.md
old mode 100644
new mode 100755
diff --git a/docs/current_tasks.md b/docs/current_tasks.md
new file mode 100644
index 00000000..1622e960
--- /dev/null
+++ b/docs/current_tasks.md
@@ -0,0 +1,122 @@
+# Current Tasks
+
+> () indicates the task name in the lmms_eval. The task name is also used to specify the dataset in the configuration file.
+> The following is manually updated documentation. You could use `lmms_eval task --list` to list all supported tasks and their task names. 
+
+- AI2D (ai2d)
+- ChartQA (chartqa)
+- CMMMU (cmmmu)
+  - CMMMU Validation (cmmmu_val)
+  - CMMMU Test (cmmmu_test)
+- COCO Caption (coco_cap)
+  - COCO 2014 Caption (coco2014_cap)
+    - COCO 2014 Caption Validation (coco2014_cap_val)
+    - COCO 2014 Caption Test (coco2014_cap_test)
+  - COCO 2017 Caption (coco2017_cap)
+    - COCO 2017 Caption MiniVal (coco2017_cap_val)
+    - COCO 2017 Caption MiniTest (coco2017_cap_test)
+- [ConBench](https://github.com/foundation-multimodal-models/ConBench) (conbench)
+- DOCVQA (docvqa)
+  - DOCVQA Validation (docvqa_val)
+  - DOCVQA Test (docvqa_test)
+- Ferret (ferret)
+- Flickr30K (flickr30k)
+  - Ferret Test (ferret_test)
+- GQA (gqa)
+- HallusionBenchmark (hallusion_bench_image)
+- Infographic VQA (info_vqa)
+  - Infographic VQA Validation (info_vqa_val)
+  - Infographic VQA Test (info_vqa_test)
+- LLaVA-Bench (llava_in_the_wild)
+- LLaVA-Bench-COCO (llava_bench_coco)
+- MathVerse (mathverse)
+  - MathVerse Text Dominant (mathverse_testmini_text_dominant)
+  - MathVerse Text Only (mathverse_testmini_text_only)
+  - MathVerse Text Lite (mathverse_testmini_text_lite)
+  - MathVerse Vision Dominant (mathverse_testmini_vision_dominant)
+  - MathVerse Vision Intensive (mathverse_testmini_vision_intensive)
+  - MathVerse Vision Only (mathverse_testmini_vision_only)
+- MathVista (mathvista)
+  - MathVista Validation (mathvista_testmini)
+  - MathVista Test (mathvista_test)
+- MMBench (mmbench)
+  - MMBench English (mmbench_en)
+    - MMBench English Dev (mmbench_en_dev)
+    - MMBench English Test (mmbench_en_test)
+  - MMBench Chinese (mmbench_cn)
+    - MMBench Chinese Dev (mmbench_cn_dev)
+    - MMBench Chinese Test (mmbench_cn_test)
+- MME (mme)
+- MMMU (mmmu)
+  - MMMU Validation (mmmu_val)
+  - MMMU Test (mmmu_test)
+- MMUPD (mmupd)
+  - MMUPD Base (mmupd_base)
+    - MMAAD Base (mmaad_base)
+    - MMIASD Base (mmiasd_base)
+    - MMIVQD Base (mmivqd_base)
+  - MMUPD Option (mmupd_option)
+    - MMAAD Option (mmaad_option)
+    - MMIASD Option (mmiasd_option)
+    - MMIVQD Option (mmivqd_option)
+  - MMUPD Instruction (mmupd_instruction)
+    - MMAAD Instruction (mmaad_instruction)
+    - MMIASD Instruction (mmiasd_instruction)
+    - MMIVQD Instruction (mmivqd_instruction)
+- MMVet (mmvet)
+- Multi-DocVQA (multidocvqa)
+  - Multi-DocVQA Validation (multidocvqa_val)
+  - Multi-DocVQA Test (multidocvqa_test)
+- NoCaps (nocaps)
+  - NoCaps Validation (nocaps_val)
+  - NoCaps Test (nocaps_test)
+- OKVQA (ok_vqa)
+  - OKVQA Validation 2014 (ok_vqa_val2014)
+- POPE (pope)
+- RefCOCO (refcoco)
+    - refcoco_seg_test
+    - refcoco_seg_val
+    - refcoco_seg_testA
+    - refcoco_seg_testB
+    - refcoco_bbox_test
+    - refcoco_bbox_val
+    - refcoco_bbox_testA
+    - refcoco_bbox_testB
+- RefCOCO+ (refcoco+)
+    - refcoco+_seg
+        - refcoco+_seg_val
+        - refcoco+_seg_testA
+        - refcoco+_seg_testB
+    - refcoco+_bbox
+        - refcoco+_bbox_val
+        - refcoco+_bbox_testA
+        - refcoco+_bbox_testB
+- RefCOCOg (refcocog)
+    - refcocog_seg_test
+    - refcocog_seg_val
+    - refcocog_bbox_test
+    - refcocog_bbox_val
+- ScienceQA (scienceqa_full)
+  - ScienceQA Full (scienceqa)
+  - ScienceQA IMG (scienceqa_img)
+- ScreenSpot (screenspot)
+  - ScreenSpot REC / Grounding (screenspot_rec)
+  - ScreenSpot REG / Instruction Generation (screenspot_reg)
+- SeedBench (seedbench)
+- SeedBench 2 (seedbench_2)
+- ST-VQA (stvqa)
+- TextCaps (textcaps)
+  - TextCaps Validation (textcaps_val)
+  - TextCaps Test (textcaps_test)
+- TextVQA (textvqa)
+  - TextVQA Validation (textvqa_val)
+  - TextVQA Test (textvqa_test)
+- VizWizVQA (vizwiz_vqa)
+  - VizWizVQA Validation (vizwiz_vqa_val)
+  - VizWizVQA Test (vizwiz_vqa_test)
+- VQAv2 (vqav2)
+  - VQAv2 Validation (vqav2_val)
+  - VQAv2 Test (vqav2_test)
+- WebSRC (websrc)
+  - WebSRC Validation (websrc_val)
+  - WebSRC Test (websrc_test)
\ No newline at end of file
diff --git a/docs/model_guide.md b/docs/model_guide.md
old mode 100644
new mode 100755
diff --git a/docs/task_guide.md b/docs/task_guide.md
old mode 100644
new mode 100755
index 31fb443d..1376bc22
--- a/docs/task_guide.md
+++ b/docs/task_guide.md
@@ -27,7 +27,7 @@ doc_to_target: "answer"
 generation_kwargs:
   max_new_tokens: 16
   temperature: 0
-  top_p: 0
+  top_p: 1.0
   num_beams: 1
   do_sample: false
 # The return value of process_results will be used by metrics
diff --git a/example_eval.yaml b/example_eval.yaml
deleted file mode 100644
index 40e29a85..00000000
--- a/example_eval.yaml
+++ /dev/null
@@ -1,15 +0,0 @@
-- model: llava
-  model_args: pretrained=liuhaotian/llava-v1.5-7b
-  tasks: ai2d
-  batch_size: 1
-  log_samples: true
-  log_samples_suffix: eval_vizwiz_vqa
-  output_path: "./logs/"
-
-- model: llava
-  model_args: pretrained=liuhaotian/llava-v1.5-13b
-  tasks: mme
-  batch_size: 1
-  log_samples: true
-  log_samples_suffix: mme
-  output_path: "./logs/"
diff --git a/lmms_eval/__init__.py b/lmms_eval/__init__.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/__main__.py b/lmms_eval/__main__.py
old mode 100644
new mode 100755
index c852d2f4..2949705f
--- a/lmms_eval/__main__.py
+++ b/lmms_eval/__main__.py
@@ -106,9 +106,16 @@ def parse_eval_args() -> argparse.Namespace:
     parser.add_argument(
         "--log_samples_suffix",
         type=str,
-        default="",
+        default="model_outputs",
         help="Specify a suffix for the log_samples file name.",
     )
+    parser.add_argument(
+        "--predict_only",
+        "-x",
+        action="store_true",
+        default=False,
+        help="Use with --log_samples. Only model outputs will be saved and metrics will not be evaluated.",
+    )
     parser.add_argument(
         "--show_config",
         action="store_true",
@@ -228,6 +235,10 @@ def cli_evaluate_single(args: Union[argparse.Namespace, None] = None) -> None:
 
     initialize_tasks(args.verbosity)
 
+    if args.predict_only:
+        args.log_samples = True
+    if (args.log_samples or args.predict_only) and not args.output_path:
+        raise ValueError("Specify --output_path if providing --log_samples or --predict_only")
     if args.limit:
         eval_logger.warning(" --limit SHOULD ONLY BE USED FOR TESTING." "REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.")
     if args.include_path is not None:
@@ -274,6 +285,10 @@ def cli_evaluate_single(args: Union[argparse.Namespace, None] = None) -> None:
     # set datetime before evaluation
     datetime_str = utils.get_datetime_str(timezone=args.timezone)
     if args.output_path:
+        if args.log_samples_suffix and len(args.log_samples_suffix) > 15:
+            eval_logger.warning("The suffix for log_samples is too long. It is recommended to keep it under 15 characters.")
+            args.log_samples_suffix = args.log_samples_suffix[:5] + "..." + args.log_samples_suffix[-5:]
+
         hash_input = f"{args.model_args}".encode("utf-8")
         hash_output = hashlib.sha256(hash_input).hexdigest()[:6]
         path = Path(args.output_path)
@@ -296,6 +311,7 @@ def cli_evaluate_single(args: Union[argparse.Namespace, None] = None) -> None:
         log_samples=args.log_samples,
         gen_kwargs=args.gen_kwargs,
         cli_args=args,
+        predict_only=args.predict_only,
     )
 
     if results is not None:
@@ -318,9 +334,9 @@ def cli_evaluate_single(args: Union[argparse.Namespace, None] = None) -> None:
                 for task_name, config in results["configs"].items():
                     filename = args.output_path.joinpath(f"{task_name}.json")
                     # Structure the data with 'args' and 'logs' keys
-                    data_to_dump = {"args": vars(args), "model_configs": config, "logs": sorted(samples[task_name], key=lambda x: x["doc_id"])}  # Convert Namespace to dict
-                    samples_dumped = json.dumps(data_to_dump, indent=4, default=_handle_non_serializable)
-                    filename.open("w").write(samples_dumped)
+                    data_to_dump = {"args": vars(args), "model_configs": config, "logs": sorted(samples[task_name], key=lambda x: x["doc_id"]), "time": datetime_str}
+                    samples_dumped = json.dumps(data_to_dump, indent=4, default=_handle_non_serializable, ensure_ascii=False)
+                    filename.open("w", encoding="utf-8").write(samples_dumped)
                     eval_logger.info(f"Saved samples to {filename}")
 
         return results, samples
diff --git a/lmms_eval/api/__init__.py b/lmms_eval/api/__init__.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/api/filter.py b/lmms_eval/api/filter.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/api/instance.py b/lmms_eval/api/instance.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/api/metrics.py b/lmms_eval/api/metrics.py
old mode 100644
new mode 100755
index 67958f51..c0e5c505
--- a/lmms_eval/api/metrics.py
+++ b/lmms_eval/api/metrics.py
@@ -16,6 +16,11 @@
 
 
 # Register Aggregations First
+@register_aggregation("bypass")
+def bypass_agg(arr):
+    return 999
+
+
 @register_aggregation("mean")
 def mean(arr):
     return sum(arr) / len(arr)
@@ -226,6 +231,16 @@ def mean_stderr(arr):
     return sample_stddev(arr) / math.sqrt(len(arr))
 
 
+@register_metric(
+    metric="bypass",
+    higher_is_better=True,
+    output_type=["loglikelihood", "multiple_choice", "generate_until"],
+    aggregation="bypass",
+)
+def bypass(items):
+    return items
+
+
 @register_metric(
     metric="mcc",
     higher_is_better=True,
diff --git a/lmms_eval/api/model.py b/lmms_eval/api/model.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/api/registry.py b/lmms_eval/api/registry.py
old mode 100644
new mode 100755
index 0728b86d..253341db
--- a/lmms_eval/api/registry.py
+++ b/lmms_eval/api/registry.py
@@ -1,6 +1,8 @@
 from lmms_eval.api.model import lmms
 
+from typing import Callable, Dict
 import logging
+import evaluate as hf_evaluate
 
 eval_logger = logging.getLogger("lmms-eval")
 
@@ -104,6 +106,22 @@ def decorate(fn):
     return decorate
 
 
+def get_metric(name: str, hf_evaluate_metric=False) -> Callable:
+    if not hf_evaluate_metric:
+        if name in METRIC_REGISTRY:
+            return METRIC_REGISTRY[name]
+        else:
+            eval_logger.warning(f"Could not find registered metric '{name}' in lm-eval, searching in HF Evaluate library...")
+
+    try:
+        metric_object = hf_evaluate.load(name)
+        return metric_object.compute
+    except Exception:
+        eval_logger.error(
+            f"{name} not found in the evaluate library! Please check https://huggingface.co/evaluate-metric",
+        )
+
+
 def register_aggregation(name):
     def decorate(fn):
         assert name not in AGGREGATION_REGISTRY, f"aggregation named '{name}' conflicts with existing registered aggregation!"
diff --git a/lmms_eval/api/samplers.py b/lmms_eval/api/samplers.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/api/task.py b/lmms_eval/api/task.py
old mode 100644
new mode 100755
index 0a58d981..c035a0a2
--- a/lmms_eval/api/task.py
+++ b/lmms_eval/api/task.py
@@ -1,45 +1,41 @@
 import abc
-from dataclasses import dataclass, field, asdict
-
-import itertools
-import os
-import re
 import ast
+import itertools
+import json
 import logging
+import os
 import random
-from tqdm import tqdm
+import re
+import shutil
+import subprocess
+from collections.abc import Callable
+from dataclasses import dataclass, field, asdict
+from glob import glob
+from typing import Any, List, Union
 
 import datasets
-from datasets import Image, Sequence
 import numpy as np
 from PIL import ImageFile
+from datasets import DownloadConfig, Image, Sequence
+from huggingface_hub import snapshot_download
+from tenacity import retry, stop_after_attempt, wait_fixed, stop_after_delay
+from tqdm import tqdm
 
-from datasets import DownloadConfig
-from typing import Union, List, Any
-from collections.abc import Callable
-from tenacity import retry, stop_after_attempt, wait_fixed
-
+from accelerate import Accelerator
 from lmms_eval import utils
 from lmms_eval.api import samplers
 from lmms_eval.api.instance import Instance
-
-from lmms_eval.filters import build_filter_ensemble
 from lmms_eval.api.registry import (
-    get_aggregation,
-    get_metric_aggregation,
-    is_higher_better,
+    AGGREGATION_REGISTRY,
     DEFAULT_METRIC_REGISTRY,
     METRIC_REGISTRY,
     OUTPUT_TYPE_REGISTRY,
-    AGGREGATION_REGISTRY,
+    get_aggregation,
+    get_metric,
+    get_metric_aggregation,
+    is_higher_better,
 )
-
-ALL_OUTPUT_TYPES = [
-    "loglikelihood",
-    "multiple_choice",
-    "generate_until",
-]
-
+from lmms_eval.filters import build_filter_ensemble
 
 eval_logger = logging.getLogger("lmms-eval")
 
@@ -47,6 +43,12 @@
 # Include this inside code block to avoid error
 ImageFile.LOAD_TRUNCATED_IMAGES = True
 
+ALL_OUTPUT_TYPES = [
+    "loglikelihood",
+    "multiple_choice",
+    "generate_until",
+]
+
 
 @dataclass
 class TaskConfig(dict):
@@ -100,7 +102,7 @@ def __post_init__(self) -> None:
             import inspect
             from importlib import import_module
 
-            self.dataset_path = inspect.getfile(import_module(self.dataset_path))
+            # self.dataset_path = inspect.getfile(import_module(self.dataset_path))
 
         if self.generation_kwargs is not None:
             if self.output_type != "generate_until":
@@ -508,6 +510,29 @@ def dump_config(self) -> dict:
         # (num_fewshot)
         return self.config.to_dict()
 
+    def override_metric(self, metric_name: str) -> None:
+        """
+        Override the default metrics used for evaluation with custom metrics.
+
+        Parameters:
+        - metric_name (str): The name of the custom metric to override. Should be registered in api.metrics.
+        """
+        (
+            self._metric_fn_list,
+            self._aggregation_list,
+            self._metric_fn_kwargs,
+            self._higher_is_better,
+        ) = ({}, {}, {}, {})
+        self._metric_fn_list[metric_name] = get_metric(metric_name)
+        self._aggregation_list[metric_name] = get_metric_aggregation(metric_name)
+        self._higher_is_better[metric_name] = is_higher_better(metric_name)
+        self._metric_fn_kwargs[metric_name] = {}
+        if not isinstance(self, ConfigurableTask):
+            self.process_results = lambda x, y: {metric_name: get_metric(metric_name)}
+            self.aggregation = lambda: {metric_name: get_metric_aggregation(metric_name)}
+        setattr(self._config, "metric_list", [{"metric": metric_name}])
+        setattr(self._config, "process_results", None)
+
 
 class ConfigurableTask(Task):
     VERSION = "Yaml"
@@ -676,42 +701,127 @@ def _prepare_metric_and_aggregation(self):
                     eval_logger.warning(f"[Task: {self._config.task}] metric {metric_name} is defined, but higher_is_better is not. " f"using default " f"higher_is_better={is_higher_better(metric_name)}")
                     self._higher_is_better[metric_name] = is_higher_better(metric_name)
 
-    @retry(stop=stop_after_attempt(5), wait=wait_fixed(2))
+    @retry(stop=(stop_after_attempt(5) | stop_after_delay(60)), wait=wait_fixed(2))
     def download(self, dataset_kwargs=None) -> None:
         # If the dataset is a video dataset,
         # Recursively search whether their is a zip and unzip it to the huggingface home
-        if dataset_kwargs is not None and "video" in dataset_kwargs and dataset_kwargs["video"]:
-            hf_home = os.environ["HF_HOME"]
-            cache_dir = dataset_kwargs["cache_dir"]
-            dataset_kwargs.pop("cache_dir")
-            cache_dir = os.path.join(hf_home, cache_dir)
-            cache_path = snapshot_download(repo_id=self.DATASET_PATH, repo_type="dataset")
-            zip_files = glob(os.path.join(cache_path, "**/*.zip"), recursive=True)
-            if not os.path.exists(cache_dir):
-                for zip_file in zip_files:
-                    shutil.unpack_archive(zip_file, cache_dir)
-            builder_script = dataset_kwargs["builder_script"]
-            self.DATASET_PATH = os.path.join(cache_path, builder_script)
-            dataset_kwargs.pop("video")
-            dataset_kwargs.pop("builder_script")
         download_config = DownloadConfig()
-        download_config.max_retries = dataset_kwargs.get("max_retries", 3) if dataset_kwargs is not None else 3
+        download_config.max_retries = dataset_kwargs.get("max_retries", 10) if dataset_kwargs is not None else 10
         download_config.num_proc = dataset_kwargs.get("num_proc", 8) if dataset_kwargs is not None else 8
+        download_config.local_files_only = dataset_kwargs.get("local_files_only", False) if dataset_kwargs is not None else False
+        if dataset_kwargs is not None:
+            if "From_YouTube" in dataset_kwargs:
+
+                def _download_from_youtube(path):
+                    try:
+                        for video in tqdm(self.all_dataset[split]):
+                            video_id = video["videoID"]
+                            target_path = os.path.join(path, f"{video_id}.mp4")
+                            assert shutil.which("yt-dlp") is not None, "yt-dlp must be installed and available in the system's PATH"
+                            command = f"yt-dlp -o {target_path} -f mp4 https://www.youtube.com/watch?v={video_id}"
+                            subprocess.run(command, shell=True)
+                        with open(os.path.join(cache_path, f"{task}_download_status.json"), "w") as f:
+                            f.write(json.dumps({task: "downloaded"}))
+                    except Exception as e:
+                        eval_logger.error(f"Error while downloading {task} data: {e}")
+                        with open(os.path.join(cache_path, f"{task}_download_status.json"), "w") as f:
+                            f.write(json.dumps({task: "not downloaded"}))
+
+                hf_home = os.getenv("HF_HOME", "~/.cache/huggingface/")
+                accelerator = Accelerator()
+                if accelerator.is_main_process:
+                    dataset_kwargs.pop("From_YouTube")
+                    self.all_dataset = datasets.load_dataset(
+                        path=self.DATASET_PATH,
+                        name=self.DATASET_NAME,
+                        download_mode=datasets.DownloadMode.REUSE_DATASET_IF_EXISTS,
+                        **dataset_kwargs if dataset_kwargs is not None else {},
+                    )
+                    dataset_kwargs["From_YouTube"] = True
+                    cache_path = snapshot_download(repo_id=self.DATASET_PATH, repo_type="dataset")  # download_parquet
+                    split = vars(self.config)["test_split"]
+                    task = vars(self.config)["task"]
+
+                    video_path = os.path.join(hf_home, task)
+                    if os.path.exists(os.path.join(cache_path, f"{task}_download_status.json")):
+                        download_status = json.load(open(os.path.join(cache_path, f"{task}_download_status.json"), "r"))
+                        if download_status[task] == "downloaded":
+                            eval_logger.info(f"Data for {task} already download!")
+                        else:
+                            eval_logger.info(f"Start downloading YouTube data to {video_path}...")
+                            _download_from_youtube(video_path)
+                    else:
+                        eval_logger.info(f"Start downloading YouTube data to {video_path}...")
+                        _download_from_youtube(video_path)
+
+                accelerator.wait_for_everyone()
+                if "builder_script" in dataset_kwargs:
+                    builder_script = dataset_kwargs["builder_script"]
+                    self.DATASET_PATH = os.path.join(cache_path, builder_script)
+                    dataset_kwargs.pop("builder_script")
+
+                downloaded_video_ids = [i.split(".mp4")[0] for i in os.listdir(os.path.expanduser(video_path)) if i.endswith(".mp4")]
+                # Filtered the existing dataset with the downloaded video ids
+                self.dataset = datasets.DatasetDict({split: self.all_dataset[split].filter(lambda x: x["videoID"] in downloaded_video_ids)})
+
+                self.dataset_no_image = self.dataset
+                dataset_kwargs.pop("From_YouTube")
+                return
+
+            if "video" in dataset_kwargs and dataset_kwargs["video"]:
+                hf_home = os.getenv("HF_HOME", "~/.cache/huggingface/")
+                cache_dir = dataset_kwargs["cache_dir"]
+                cache_dir = os.path.join(hf_home, cache_dir)
+                accelerator = Accelerator()
+                if accelerator.is_main_process:
+                    force_download = dataset_kwargs.get("force_download", False)
+                    force_unzip = dataset_kwargs.get("force_unzip", False)
+                    cache_path = snapshot_download(repo_id=self.DATASET_PATH, repo_type="dataset", force_download=force_download, etag_timeout=60)
+                    zip_files = glob(os.path.join(cache_path, "**/*.zip"), recursive=True)
+
+                    def unzip_video_data(zip_file):
+                        import zipfile
+
+                        with zipfile.ZipFile(zip_file, "r") as zip_ref:
+                            zip_ref.extractall(cache_dir)
+                            eval_logger.info(f"Extracted all files from {zip_file} to {cache_dir}")
+
+                    if force_unzip or (not os.path.exists(cache_dir) and len(zip_files) > 0):
+                        for zip_file in zip_files:
+                            unzip_video_data(zip_file)
+
+                accelerator.wait_for_everyone()
+                dataset_kwargs.pop("cache_dir")
+                dataset_kwargs.pop("video")
+
+            if "builder_script" in dataset_kwargs:
+                builder_script = dataset_kwargs["builder_script"]
+                self.DATASET_PATH = os.path.join(cache_path, builder_script)
+                dataset_kwargs.pop("builder_script")
+
+            if "force_download" in dataset_kwargs:
+                dataset_kwargs.pop("force_download")
+
+            if "force_unzip" in dataset_kwargs:
+                dataset_kwargs.pop("force_unzip")
+
+            if "local_files_only" in dataset_kwargs:
+                dataset_kwargs.pop("local_files_only")
+
         self.dataset = datasets.load_dataset(
             path=self.DATASET_PATH,
             name=self.DATASET_NAME,
             download_mode=datasets.DownloadMode.REUSE_DATASET_IF_EXISTS,
+            download_config=download_config,
+            **dataset_kwargs if dataset_kwargs is not None else {},
+        )
+        self.dataset_no_image = datasets.load_dataset(
+            path=self.DATASET_PATH,
+            name=self.DATASET_NAME,
+            download_mode=datasets.DownloadMode.REUSE_DATASET_IF_EXISTS,
+            download_config=download_config,
             **dataset_kwargs if dataset_kwargs is not None else {},
         )
-        if self.config.process_docs is not None:
-            for split in self.dataset:
-                if split in [
-                    self.config.training_split, self.config.validation_split, self.config.test_split, self.config.fewshot_split
-                ]:
-                    self.dataset[split] = self.config.process_docs(self.dataset[split])
-
-        # copy dataset, remove image features
-        self.dataset_no_image = self.dataset.copy()
         for doc_name in self.dataset_no_image:
             remove_cols = []
             features = self.dataset_no_image[doc_name].features
@@ -744,14 +854,20 @@ def has_test_docs(self) -> bool:
 
     def training_docs(self) -> datasets.Dataset:
         if self.has_training_docs():
+            if self.config.process_docs is not None:
+                return self.config.process_docs(self.dataset[self.config.training_split])
             return self.dataset[self.config.training_split]
 
     def validation_docs(self) -> datasets.Dataset:
         if self.has_validation_docs():
+            if self.config.process_docs is not None:
+                return self.config.process_docs(self.dataset[self.config.validation_split])
             return self.dataset[self.config.validation_split]
 
     def test_docs(self) -> datasets.Dataset:
         if self.has_test_docs():
+            if self.config.process_docs is not None:
+                return self.config.process_docs(self.dataset[self.config.test_split])
             return self.dataset[self.config.test_split]
 
     def fewshot_docs(self):
@@ -985,11 +1101,17 @@ def construct_requests(self, doc_id: int, ctx: str, **kwargs) -> Union[List[Inst
             arguments = (ctx, self.config.generation_kwargs, self.doc_to_visual, doc_id, self.config.task, split)
         return Instance(request_type=self.OUTPUT_TYPE, arguments=arguments, idx=0, **kwargs)
 
-    def process_results(self, doc, results):
+    # TODO: we add a full_docs interface here for some evaluations that needs to access the full datasets during process_results function. we may have better ways to handle this.
+    @retry(stop=(stop_after_attempt(5) | stop_after_delay(1200)), wait=wait_fixed(2))
+    def process_results(self, doc, results, full_docs=None):
         if self.OUTPUT_TYPE == "generate_until":
             results[0] = results[0].strip()
+
+        kwargs = {}
+        if full_docs is not None:
+            kwargs["full_docs"] = full_docs
         if callable(self.config.process_results):
-            return self.config.process_results(doc, results)
+            return self.config.process_results(doc, results, **kwargs)
 
         result_dict = {}
         use_metric = list(self._metric_fn_list.keys())
diff --git a/lmms_eval/evaluator.py b/lmms_eval/evaluator.py
old mode 100644
new mode 100755
index a97edff0..8a4c49d8
--- a/lmms_eval/evaluator.py
+++ b/lmms_eval/evaluator.py
@@ -17,6 +17,8 @@
 import lmms_eval.api.metrics
 import lmms_eval.api.registry
 
+import re
+
 from lmms_eval.utils import (
     positional_deprecated,
     run_task_tests,
@@ -44,6 +46,7 @@ def simple_evaluate(
     log_samples: bool = True,
     gen_kwargs: str = None,
     cli_args=None,  # Bo: put args into more functions (cost 48 Bytes per call)
+    predict_only: bool = False,
 ):
     """Instantiate and evaluate a model on a list of tasks.
 
@@ -111,6 +114,12 @@ def simple_evaluate(
         if config["output_type"] == "generate_until" and gen_kwargs:
             config["generation_kwargs"].update(gen_kwargs)
 
+        if predict_only:
+            log_samples = True
+            eval_logger.info(f"Processing {task_name} in output-only mode. Metrics will not be calculated!")
+            # we have to change the class properties post-hoc. This is pretty hacky.
+            task_obj.override_metric(metric_name="bypass")
+
         if num_fewshot is not None:
             if config["num_fewshot"] == 0:
                 eval_logger.info(f"num_fewshot has been set to 0 for {task_name} in its config. Manual configuration will be ignored.")
@@ -285,7 +294,7 @@ def evaluate(
                 cloned_reqs.extend([req] * req.repeats)
 
         # run requests through model
-        resps = getattr(lm, reqtype)(cloned_reqs)
+        resps = getattr(lm, reqtype)(cloned_reqs)  # Choiszt run generate until
 
         # put responses from model into a list of length K for each request.
         for x, req in zip(resps, cloned_reqs):
@@ -318,7 +327,7 @@ def evaluate(
             # hack: remove image columns to speed avoid loading images and speed up postprocessing
             # reason: doc_iterator will actually load image if it's in the doc.
             docs = task.test_docs() if task.has_test_docs() else task.validation_docs()
-            if "d170" not in task_name and "dc100" not in task_name and "dc200" not in task_name:
+            if "d170" not in task_name and "dc100" not in task_name and "dc200" not in task_name and "llava_wilder" not in task_name and "livebench" not in task_name:
                 remove_cols = []
                 features = docs.features
                 # If it is an Image instance or a Sequence of Image instance. Remove it
@@ -329,6 +338,13 @@ def evaluate(
                         remove_cols.append(feature)
                 if remove_cols:
                     docs = docs.remove_columns(remove_cols)
+
+            ####################### Processing with Full Docs Mode #######################
+            if task_name in ["videochatgpt_consistency"]:
+                full_docs = True
+            else:
+                full_docs = False
+
             doc_iterator = itertools.islice(enumerate(docs), lm.rank, limit, lm.world_size)
             # Instead of converting the iterator to a list, use `itertools.tee` to create a parallel iterator for counting
             # doc_iterator, doc_iterator_for_counting = itertools.tee(doc_iterator)
@@ -340,7 +356,10 @@ def evaluate(
                 # subset instances to only this document id ; sort by idx
                 requests = list(filter(lambda x: x.doc_id == doc_id, task.instances))
                 requests.sort(key=lambda x: x.idx)
-                metrics = task.process_results(doc, [req.filtered_resps[key] for req in requests])
+                if full_docs:
+                    metrics = task.process_results(doc, [req.filtered_resps[key] for req in requests], full_docs=docs)
+                else:
+                    metrics = task.process_results(doc, [req.filtered_resps[key] for req in requests])
                 if log_samples:
                     target = task.doc_to_target(doc)
                     example = {
@@ -403,6 +422,8 @@ def evaluate(
                 vals_torch[(task_name, key, metric)] = gathered_item
 
         vals = vals_torch
+        # Ensure all ranks wait for rank 0 to finish aggregation
+        torch.distributed.barrier()
 
     if lm.rank == 0:
         ### Get task ordering for correct sample-wide aggregation
@@ -502,11 +523,22 @@ def evaluate(
                                 continue
 
                             if metric in results[group]:
-                                results[group][metric] = (results[group][metric] * total_size + metric_score * current_size) / (total_size + current_size)
-                                # $$s_z^2 = \frac{(n-1) s_x^2 + (m-1) s_y^2}{n+m-1} + \frac{nm(\bar x - \bar y)^2}{(n+m)(n+m-1)}.$$
-                                results[group][stderr] = ((total_size - 1) * results[group][stderr] + (current_size - 1) * var_score) / (total_size + current_size - 1) + total_size * current_size / (
-                                    (total_size + current_size) * (total_size + current_size - 1)
-                                ) * (results[group][metric] - metric_score) ** 2
+                                if isinstance(results[group][metric], str) == False:
+                                    results[group][metric] = (results[group][metric] * total_size + metric_score * current_size) / (total_size + current_size)
+                                    # $$s_z^2 = \frac{(n-1) s_x^2 + (m-1) s_y^2}{n+m-1} + \frac{nm(\bar x - \bar y)^2}{(n+m)(n+m-1)}.$$
+                                    results[group][stderr] = ((total_size - 1) * results[group][stderr] + (current_size - 1) * var_score) / (total_size + current_size - 1) + total_size * current_size / (
+                                        (total_size + current_size) * (total_size + current_size - 1)
+                                    ) * (results[group][metric] - metric_score) ** 2
+                                else:
+                                    # accuracy = re.search(r'acc: ([\d.]+)%', results[group][metric]).group(1)
+                                    # score = re.search(r'score: ([\d.]+)', results[group][metric]).group(1)
+                                    # group_accuracy = float(accuracy)
+                                    # group_score = float(score)
+                                    # group_accuracy = (group_accuracy * total_size + metric_score * current_size) / total_size
+                                    # group_score = (group_score * total_size + metric_score * current_size) / total_size
+                                    # results[group][metric] = "Acc: " + str(group_accuracy) + " Score: " + str(group_score)
+                                    results[group][metric] = "group_results"
+                                    results[group][stderr] = 0
                             else:
                                 results[group][metric] = metric_score
                                 results[group][stderr] = var_score
diff --git a/lmms_eval/filters/__init__.py b/lmms_eval/filters/__init__.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/filters/decontamination.py b/lmms_eval/filters/decontamination.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/filters/extraction.py b/lmms_eval/filters/extraction.py
old mode 100644
new mode 100755
index 329d7540..f3045673
--- a/lmms_eval/filters/extraction.py
+++ b/lmms_eval/filters/extraction.py
@@ -212,3 +212,67 @@ def find_match(self, regex, resp, convert_dict={}):
             if match and match in convert_dict:
                 match = convert_dict[match]
         return match
+
+
+# Designed for the AI2D/RealworldQA dataset
+class SimpleMultiChoiceRegexFilter(ExtendedRegexFilter):
+    def __init__(self, *args, **kwargs):
+        """
+        regex_pattern: The basic regex pattern to use. If fails to match, we will use the customized match procedure
+                        - step 1 : We parse the choices between ([A-Z])s then try to find these choices in the response.
+                        - step 2 : We parse the choice with regex :[\s]*([A-?]), where ? varies by number of choices.
+        group_select: Selects the (group_select)th match from the findall result.
+        ignore_case: Ignores the case during step 1 matching
+        ignore_punctuation: Remove the punctuation during step 1 matching
+        regexes_to_ignore: Remove these regexes during step 1 matching
+        """
+        super().__init__(*args, **kwargs)
+
+    def apply(self, resps, docs):
+        # here, we assume we have a list, in which each element is
+        # a list of model responses for some particular input/target pair.
+        # so we process each of these (same input/target response sets)
+        # independently (and keep them a list.)
+
+        filtered_resps = []
+
+        for r, doc in zip(resps, docs):
+            fallback_regexes = []
+            choice_to_alpha = {}
+            next_alpha = "A"
+
+            without_paren_fallback_regexes = []
+            without_paren_to_target = {}
+
+            # Regex to extract multiple choice options from the question
+            multiple_choices_regex = re.compile(r"\b([A-Z])\.\s+([^\n]*)")
+            matches = multiple_choices_regex.findall(doc["question"])
+
+            # Build regex patterns and mappings for each choice
+            for m in matches:
+                choice_text = m[1].strip()
+                fallback_regexes.append(f"{re.escape(choice_text)}")
+                choice_to_alpha[choice_text] = next_alpha
+
+                next_alpha = chr(ord(next_alpha) + 1)
+
+            # Compile regex to match any of the extracted choices
+            fallback_regex = re.compile("|".join(fallback_regexes))
+
+            # Process each response
+            filtered = []
+            for resp in r:
+                # Remove any punctuation and extra spaces
+                cleaned_resp = re.sub(r"[^\w\s]", "", resp).strip()
+                # Try to match cleaned response with the choice text
+                match = fallback_regex.search(cleaned_resp)
+                if match and match.group() in choice_to_alpha:
+                    # Map the matched choice text back to its corresponding letter
+                    filtered.append(choice_to_alpha[match.group()])
+                else:
+                    # If no match, return the cleaned response
+                    filtered.append(cleaned_resp)
+
+            filtered_resps.append(filtered[0])
+
+        return filtered_resps
diff --git a/lmms_eval/filters/selection.py b/lmms_eval/filters/selection.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/filters/transformation.py b/lmms_eval/filters/transformation.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/logging_utils.py b/lmms_eval/logging_utils.py
old mode 100644
new mode 100755
index 21a2ee04..6107d21b
--- a/lmms_eval/logging_utils.py
+++ b/lmms_eval/logging_utils.py
@@ -89,10 +89,10 @@ def finish(self):
     def init_run(self):
         if "name" not in self.wandb_args:
             if "config" in self.all_args_dict and self.all_args_dict["config"] != "":
-                self.wandb_args["name"] = self.all_args_dict["config"].split("/")[-1].replace(".yaml", "") + "_" + self.args.log_samples_suffix
+                self.wandb_args["name"] = self.all_args_dict["config"].split("/")[-1].replace(".yaml", "") + "/" + self.args.log_samples_suffix
             else:
                 task_names = self.args.tasks.replace(",", "/")
-                self.wandb_args["name"] = f"{self.args.model}_{task_names}_{self.args.log_samples_suffix}"
+                self.wandb_args["name"] = f"{self.args.model}/<{task_names}>/{self.args.log_samples_suffix}"
                 if self.args.num_fewshot:
                     self.wandb_args["name"] += f"_{self.args.num_fewshot}shot"
         if "project" not in self.wandb_args:
@@ -119,6 +119,7 @@ def _get_config(self) -> Dict[str, Any]:
     def _sanitize_results_dict(self) -> Tuple[Dict[str, str], Dict[str, Any]]:
         """Sanitize the results dictionary."""
         _results = copy.deepcopy(self.results.get("results", dict()))
+        _results["model_configs"] = self.results.get("model_configs", dict())
 
         # Remove None from the metric string name
         tmp_results = copy.deepcopy(_results)
@@ -138,15 +139,18 @@ def _sanitize_results_dict(self) -> Tuple[Dict[str, str], Dict[str, Any]]:
                 if isinstance(metric_value, str):
                     wandb_summary[f"{task}/{metric_name}"] = metric_value
 
+        wandb_summary["model_configs"] = self.results.get("model_configs", dict())
         for summary_metric, summary_value in wandb_summary.items():
-            _task, _summary_metric = summary_metric.split("/")
-            _results[_task].pop(_summary_metric)
+            if summary_metric != "model_configs":
+                _task, _summary_metric = summary_metric.split("/")
+                _results[_task].pop(_summary_metric)
 
         tmp_results = copy.deepcopy(_results)
         for task_name, task_results in tmp_results.items():
-            for metric_name, metric_value in task_results.items():
-                _results[f"{task_name}/{metric_name}"] = metric_value
-                _results[task_name].pop(metric_name)
+            if task_name != "model_configs":
+                for metric_name, metric_value in task_results.items():
+                    _results[f"{task_name}/{metric_name}"] = metric_value
+                    _results[task_name].pop(metric_name)
         for task in self.task_names:
             _results.pop(task)
 
diff --git a/lmms_eval/models/__init__.py b/lmms_eval/models/__init__.py
old mode 100644
new mode 100755
index 5dbfc7ae..3fe74164
--- a/lmms_eval/models/__init__.py
+++ b/lmms_eval/models/__init__.py
@@ -1,16 +1,32 @@
 import os
+import hf_transfer
+
+os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
 
 AVAILABLE_MODELS = {
     "llava": "Llava",
-    "llava_hf": "LlavaHf",
-    "llava_sglang": "LlavaSglang",
     "qwen_vl": "Qwen_VL",
     "fuyu": "Fuyu",
+    "batch_gpt4": "BatchGPT4",
     "gpt4v": "GPT4V",
     "instructblip": "InstructBLIP",
     "minicpm_v": "MiniCPM_V",
-    "idefics2": "Idefics2",
+    "llava_vid": "LlavaVid",
+    "videoChatGPT": "VideoChatGPT",
+    "llama_vid": "LLaMAVid",
+    "video_llava": "VideoLLaVA",
+    "xcomposer2_4KHD": "XComposer2_4KHD",
+    "claude": "Claude",
     "qwen_vl_api": "Qwen_VL_API",
+    "llava_sglang": "LlavaSglang",
+    "idefics2": "Idefics2",
+    "internvl": "InternVLChat",
+    "gemini_api": "GeminiAPI",
+    "gemini_model": "GeminiModel",
+    "reka": "Reka",
+    "llava_onevision": "Llava_OneVision",
+    "from_log": "FromLog",
+    "mplug_owl_video": "mplug_Owl",
     "phi3v": "Phi3v",
 }
 
@@ -19,8 +35,3 @@
         exec(f"from .{model_name} import {model_class}")
     except ImportError:
         pass
-
-
-import hf_transfer
-
-os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
diff --git a/lmms_eval/models/batch_gpt4.py b/lmms_eval/models/batch_gpt4.py
new file mode 100755
index 00000000..54bfa149
--- /dev/null
+++ b/lmms_eval/models/batch_gpt4.py
@@ -0,0 +1,205 @@
+# Standard library imports
+from copy import deepcopy
+from io import BytesIO
+import base64
+import logging
+import os
+import time
+import json
+
+# Related third-party imports
+from accelerate import Accelerator, DistributedType, InitProcessGroupKwargs
+from accelerate.state import AcceleratorState
+import numpy as np
+from PIL import Image
+import requests as url_requests
+from tqdm import tqdm
+from openai import OpenAI
+
+# Local application/library specific imports
+from lmms_eval.api.instance import Instance
+from lmms_eval.api.model import lmms
+from lmms_eval.api.registry import register_model
+from lmms_eval import utils
+
+# Conditional imports
+try:
+    from decord import VideoReader, cpu
+except ImportError:
+    eval_logger = logging.getLogger("lmms-eval")
+    eval_logger.info("Decord is not installed. Video input will not be supported.")
+
+# Constants and global configurations
+API_TYPE = os.getenv("API_TYPE", "openai")
+NUM_SECONDS_TO_SLEEP = 5
+
+if API_TYPE == "openai":
+    API_URL = os.getenv("OPENAI_API_URL", "https://api.openai.com/v1/chat/completions")
+    API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_API_KEY")
+    headers = {
+        "Authorization": f"Bearer {API_KEY}",
+        "Content-Type": "application/json",
+    }
+elif API_TYPE == "azure":
+    API_URL = os.getenv("AZURE_ENDPOINT", "https://api.cognitive.microsoft.com/sts/v1.0/issueToken")
+    API_KEY = os.getenv("AZURE_API_KEY", "YOUR_API_KEY")
+    headers = {
+        "api-key": API_KEY,
+        "Content-Type": "application/json",
+    }
+else:
+    API_URL = "YOUR_API_URL"
+    API_KEY = "YOUR_API_KEY"
+
+
+@register_model("batch_gpt4")
+class BatchGPT4(lmms):
+    def __init__(
+        self,
+        model_version: str = "gpt-4o",
+        api_key: str = API_KEY,
+        api_url: str = API_URL,
+        modality: str = "image",
+        max_frames_for_video: int = 10,
+        timeout: int = 120,
+        **kwargs,
+    ) -> None:
+        super().__init__()
+        # Manually set a image token for GPT4V so that we can search for it
+        # and split the text and image
+        # Here we just use the same token as llava for convenient
+        self.model_version = model_version
+        self.modality = modality
+        self.max_frames_for_video = max_frames_for_video
+        self.image_token = "<image>"
+        self.timeout = timeout
+
+        self.api_key = api_key
+        self.api_url = api_url
+        self.client = OpenAI(api_key=api_key)
+
+        accelerator = Accelerator()
+        assert accelerator.state.local_process_index == 0, "BatchGPT4 does not support distributed inference."
+        assert accelerator.state.num_processes == 1, "BatchGPT4 does not support distributed inference."
+
+    # Function to encode the image
+    def encode_image(self, image: Image):
+        output_buffer = BytesIO()
+        image.save(output_buffer, format="PNG")
+        byte_data = output_buffer.getvalue()
+        base64_str = base64.b64encode(byte_data).decode("utf-8")
+        return base64_str
+
+    # Function to encode the video
+    def encode_video(self, video_path, for_get_frames_num):
+        vr = VideoReader(video_path, ctx=cpu(0))
+        total_frame_num = len(vr)
+        uniform_sampled_frames = np.linspace(0, total_frame_num - 1, for_get_frames_num, dtype=int)
+        frame_idx = uniform_sampled_frames.tolist()
+        frames = vr.get_batch(frame_idx).asnumpy()
+
+        base64_frames = []
+        for frame in frames:
+            img = Image.fromarray(frame)
+            output_buffer = BytesIO()
+            img.save(output_buffer, format="PNG")
+            byte_data = output_buffer.getvalue()
+            base64_str = base64.b64encode(byte_data).decode("utf-8")
+            base64_frames.append(base64_str)
+
+        return base64_frames
+
+    def flatten(self, input):
+        new_list = []
+        for i in input:
+            for j in i:
+                new_list.append(j)
+        return new_list
+
+    def generate_until(self, requests):
+        # Prepare the batch requests data
+        requests_data = {}
+        pbar = tqdm(total=len(requests), disable=(self.rank != 0), desc="Batch Preparing")
+        for idx, (contexts, gen_kwargs, doc_to_visual, doc_id, task, split) in enumerate([reg.args for reg in requests]):
+            visuals = [doc_to_visual(self.task_dict[task][split][doc_id])]
+            visuals = self.flatten(visuals)
+            imgs = []
+            for visual in visuals:
+                if self.modality == "image":
+                    img = self.encode_image(visual)
+                    imgs.append(img)
+                elif self.modality == "video":
+                    frames = self.encode_video(visual, self.max_frames_for_video)
+                    imgs.extend(frames)
+
+            messages = []
+            if self.image_token not in contexts:
+                messages.append({"role": "user", "content": contexts})
+                for img in imgs:
+                    messages.append({"role": "user", "content": f"data:image/jpeg;base64,{img}"})
+            else:
+                contexts_split = contexts.split(self.image_token)
+                for idx, context in enumerate(contexts_split):
+                    if idx < len(imgs):
+                        messages.append({"role": "user", "content": context})
+                        messages.append({"role": "user", "content": f"data:image/jpeg;base64,{imgs[idx]}"})
+                if len(contexts_split) > len(imgs):
+                    messages.append({"role": "user", "content": contexts_split[-1]})
+
+            requests_data[f"request-{idx}"] = {"model": self.model_version, "messages": messages, "max_tokens": gen_kwargs.get("max_new_tokens", 1024)}
+            pbar.update(1)
+
+        file_path = os.getenv("HF_HOME", "~/.cache/huggingface") + f"/batchinput_{len(requests_data)}.jsonl"
+        file_path = self.create_batch_input_file(requests_data, file_path)
+        file_id = self.upload_input_file(file_path)
+
+        batch_response = self.create_batch(file_id, metadata={"description": "Batch Processing for GPT-4"})
+        batch_status = self.check_batch_status(batch_response.id)
+        while True:
+            batch_status = self.check_batch_status(batch_response.id)
+            if batch_status.status == "completed":
+                eval_logger.info("Batch processing completed.")
+                batch_results = self.retrieve_batch_results(batch_status.output_file_id)
+                res = [result["response"]["choices"][0]["message"]["content"] for result in json.loads(batch_results)]
+                return res
+            elif batch_status.status == "failed":
+                eval_logger.info("Batch processing failed.")
+                res = ["Batch failed"] * len(requests)
+                return res
+            else:
+                eval_logger.info(f"Batch status: {batch_status.status}. Retrying in {NUM_SECONDS_TO_SLEEP} seconds.")
+                time.sleep(NUM_SECONDS_TO_SLEEP)
+
+    def loglikelihood(self, requests):
+        # TODO
+        assert False, "GPT4V not support"
+
+    def create_batch_input_file(self, requests_data, file_path="batchinput.jsonl"):
+        with open(file_path, "w") as file:
+            for request_id, data in requests_data.items():
+                json_record = json.dumps({"custom_id": request_id, "method": "POST", "url": "/v1/chat/completions", "body": data})
+                file.write(json_record + "\n")
+        return file_path
+
+    def upload_input_file(self, file_path):
+        with open(file_path, "rb") as file:
+            response = self.client.files.create(file=file, purpose="batch")
+        return response.id
+
+    def create_batch(self, file_id, metadata=None):
+        if metadata is None:
+            metadata = {}
+        response = self.client.batches.create(input_file_id=file_id, endpoint="/v1/chat/completions", completion_window="24h", metadata=metadata)
+        return response
+
+    def check_batch_status(self, batch_id):
+        return self.client.batches.retrieve(batch_id)
+
+    def retrieve_batch_results(self, file_id):
+        return self.client.files.content(file_id)
+
+    def cancel_batch(self, batch_id):
+        return self.client.batches.cancel(batch_id)
+
+    def list_batches(self, limit=10):
+        return self.client.batches.list(limit=limit)
diff --git a/lmms_eval/models/claude.py b/lmms_eval/models/claude.py
new file mode 100644
index 00000000..c629ca06
--- /dev/null
+++ b/lmms_eval/models/claude.py
@@ -0,0 +1,256 @@
+from io import BytesIO
+from copy import deepcopy
+import os
+import base64
+import json
+from typing import List, Tuple, Union
+from tqdm import tqdm
+import requests as url_requests
+import time
+import logging
+
+from lmms_eval.api.instance import Instance
+from lmms_eval.api.model import lmms
+from lmms_eval.api.registry import register_model
+from lmms_eval import utils
+
+from accelerate import Accelerator, DistributedType
+
+from PIL import Image
+
+NUM_SECONDS_TO_SLEEP = 5
+eval_logger = logging.getLogger("lmms-eval")
+
+try:
+    import anthropic
+    from decord import VideoReader, cpu
+    import numpy as np
+except Exception as e:
+    eval_logger.error(f"Error importing claude: {e}")
+
+API_URL = os.getenv("ANTHROPIC_API_URL", "https://api.anthropic.com/v1/complete")
+API_KEY = os.getenv("ANTHROPIC_API_KEY", "YOUR_API_KEY")
+
+
+@register_model("claude")
+class Claude(lmms):
+    def __init__(
+        self,
+        model_version: str = "claude-3-opus-20240229",
+        image_token: str = "<image>",  # Use to separate interleaved image and text
+        system_prompt: str = "",  # Whether you want some special system prompt here
+        modality: str = "image",
+        continual_mode: bool = False,
+        response_persistent_folder: str = None,
+        **kwargs,
+    ) -> None:
+        super().__init__()
+        self.model_version = model_version
+        self.image_token = image_token
+        self.system_prompt = system_prompt
+        self.modality = modality
+
+        self.continual_mode = continual_mode
+        if self.continual_mode and response_persistent_folder is None:
+            raise ValueError("Continual mode requires a persistent path for the response. Please provide a valid path.")
+        self.response_persistent_folder = response_persistent_folder
+        self.response_persistent_file = os.path.join(self.response_persistent_folder, f"{self.model_version}_response.json")
+
+        if os.path.exists(self.response_persistent_file):
+            with open(self.response_persistent_file, "r") as f:
+                self.response_cache = json.load(f)
+            self.cache_mode = "resume"
+        else:
+            self.response_cache = {}
+            self.cache_mode = "start"
+
+        accelerator = Accelerator()
+        if accelerator.num_processes > 1:
+            assert accelerator.distributed_type in [DistributedType.FSDP, DistributedType.MULTI_GPU, DistributedType.DEEPSPEED], "Unsupported distributed type provided. Only DDP and FSDP are supported."
+            self.accelerator = accelerator
+            if self.accelerator.is_local_main_process:
+                eval_logger.info(f"Using {accelerator.num_processes} devices with data parallelism")
+            self._rank = self.accelerator.local_process_index
+            self._world_size = self.accelerator.num_processes
+        else:
+            self.accelerator = accelerator
+            self._rank = self.accelerator.local_process_index
+            self._world_size = self.accelerator.num_processes
+
+        self.device = self.accelerator.device
+
+    def encode_image(self, image):
+        output_buffer = BytesIO()
+        image.save(output_buffer, format="PNG")
+        byte_data = output_buffer.getvalue()
+        base64_str = base64.b64encode(byte_data).decode("utf-8")
+        return base64_str
+
+    def flatten(self, input):
+        new_list = []
+        for i in input:
+            for j in i:
+                new_list.append(j)
+        return new_list
+
+    def get_image_size(self, image):
+        # Create a BytesIO object to store the image bytes
+        img_byte_array = BytesIO()
+
+        # Save the image to the BytesIO object
+        image.save(img_byte_array, format="PNG")
+
+        # Get the size of the BytesIO object
+        img_size = img_byte_array.tell()
+
+        return img_size
+
+    # The max file size is 5MB for claude
+    def shrink_image_to_file_size(self, img: Image, max_file_size=4838990) -> Image:
+        # Get the current size of the image
+        original_size = self.get_image_size(img)
+
+        # If the image size is already smaller than the desired size, return
+        if original_size <= max_file_size:
+            return img
+
+        # Calculate the ratio to shrink the image
+        # Somehow I found out sqrt ratio is not enough to shrink the image
+        # below threshold, so I guess we do more
+        shrink_ratio = min(0.9, max_file_size / original_size)
+
+        # Resize the image with the calculated ratio
+        new_width = int(img.width * shrink_ratio)
+        new_height = int(img.height * shrink_ratio)
+        img = img.resize((new_width, new_height), Image.LANCZOS)
+
+        return self.shrink_image_to_file_size(img, max_file_size)
+
+    def encode_video(self, video_path):
+        vr = VideoReader(video_path, ctx=cpu(0))
+        total_frame_num = len(vr)
+        uniform_sampled_frames = np.linspace(0, total_frame_num - 1, self.max_frames_for_video, dtype=int)
+        frame_idx = uniform_sampled_frames.tolist()
+        frames = vr.get_batch(frame_idx).asnumpy()
+
+        base64_frames = []
+        for frame in frames:
+            img = Image.fromarray(frame)
+            output_buffer = BytesIO()
+            img.save(output_buffer, format="PNG")
+            byte_data = output_buffer.getvalue()
+            base64_str = base64.b64encode(byte_data).decode("utf-8")
+            base64_frames.append(f"data:image/jpeg;base64,{base64_str}")
+
+        return base64_frames
+
+    def generate_until(self, requests) -> List[str]:
+        client = anthropic.Anthropic()
+
+        res = []
+        pbar = tqdm(total=len(requests), disable=(self.rank != 0), desc="Model Responding")
+
+        empty_image_block = {
+            "type": "image",
+            "source": {
+                "type": "base64",
+                "media_type": "image/png",
+            },
+        }
+        empty_text_block = {"type": "text"}
+        empty_messages = [
+            {
+                "role": "user",
+                "content": [],
+            }
+        ]
+
+        for contexts, gen_kwargs, doc_to_visual, doc_id, task, split in [reg.args for reg in requests]:
+            ###################### CONTINUAL MODE ######################
+            if self.continual_mode is True and self.cache_mode == "resume":
+                doc_uuid = f"{task}___{split}___{doc_id}"
+                if doc_uuid in self.response_cache:
+                    response_text = self.response_cache[doc_uuid]
+                    if response_text:
+                        res.append(response_text)
+                        pbar.update(1)
+                        continue
+
+            visuals = [doc_to_visual(self.task_dict[task][split][doc_id])]
+            visuals = self.flatten(visuals)
+            imgs = []
+            for visual in visuals:
+                if isinstance(visual, str) and os.path.exists(visual):  # Assuming visual is a path to a video
+                    visual = self.encode_video(visual)
+                    for img in visual:
+                        imgs.append(img)
+                else:
+                    visual = self.shrink_image_to_file_size(visual)
+                    img = self.encode_image(visual)
+                    imgs.append(img)
+
+            messages = deepcopy(empty_messages)
+
+            if self.image_token not in contexts:
+                for img in imgs:
+                    image_block = deepcopy(empty_image_block)
+                    image_block["source"]["data"] = img
+                    messages[0]["content"].append(image_block)
+                text_block = deepcopy(empty_text_block)
+                text_block["text"] = contexts
+                messages[0]["content"].append(text_block)
+            else:
+                contexts = contexts.split(self.image_token)
+                for idx, img in enumerate(imgs):
+                    text_block = deepcopy(empty_text_block)
+                    image_block = deepcopy(empty_image_block)
+                    text_block["text"] = contexts
+                    messages[0]["content"].append(text_block)
+                    image_block["source"]["data"] = img
+                    messages[0]["content"].append(image_block)
+
+                # If n image tokens are in the contexts
+                # contexts will be splitted into n+1 chunks
+                # Manually add it into the messages
+                text_block = deepcopy(empty_text_block)
+                text_block["text"] = contexts
+                messages["content"].append(text_block)
+
+            if "max_new_tokens" not in gen_kwargs:
+                gen_kwargs["max_new_tokens"] = 1024
+            if "temperature" not in gen_kwargs:
+                gen_kwargs["temperature"] = 0
+            if "top_p" not in gen_kwargs:
+                gen_kwargs["top_p"] = None
+            if "num_beams" not in gen_kwargs:
+                gen_kwargs["num_beams"] = 1
+
+            for attempt in range(5):
+                try:
+                    message = client.messages.create(model=self.model_version, max_tokens=gen_kwargs["max_new_tokens"], system=self.system_prompt, temperature=gen_kwargs["temperature"], top_p=gen_kwargs["top_p"], messages=messages)
+                except Exception as e:
+                    eval_logger.info(f"Attempt {attempt + 1} failed with error: {str(e)}")
+                    if attempt < 5 - 1:  # If we have retries left, sleep and then continue to next attempt
+                        time.sleep(NUM_SECONDS_TO_SLEEP)
+                    else:  # If this was the last attempt, log and return empty
+                        eval_logger.error(f"All 5 attempts failed. Last error message: {str(e)}")
+                        res.append("")
+                        pbar.update(1)
+                        continue
+
+            res.append(message.content[0].text)
+            pbar.update(1)
+
+            ###################### CONTINUAL MODE ######################
+            if self.continual_mode is True:  # Cache the response
+                doc_uuid = f"{task}___{split}___{doc_id}"
+                self.response_cache[doc_uuid] = response_text
+                with open(self.response_persistent_file, "w") as f:
+                    json.dump(self.response_cache, f)
+
+        pbar.close()
+
+        return res
+
+    def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]:
+        assert False, "Not supported for claude"
diff --git a/lmms_eval/models/from_log.py b/lmms_eval/models/from_log.py
new file mode 100644
index 00000000..4c573e0f
--- /dev/null
+++ b/lmms_eval/models/from_log.py
@@ -0,0 +1,117 @@
+import logging
+import json
+import os
+import re
+
+from datetime import datetime
+from typing import List, Tuple
+from tqdm import tqdm
+from lmms_eval.api.registry import register_model
+from lmms_eval.api.model import lmms
+from lmms_eval.api.instance import Instance
+from accelerate import Accelerator, DistributedType
+
+eval_logger = logging.getLogger("lmms-eval")
+
+
+@register_model("from_log")
+class FromLog(lmms):
+    def __init__(
+        self,
+        logs: str = "logs",
+        model_name: str = None,
+        model_args: str = None,
+        have_limits: bool = False,
+        **kwargs,
+    ) -> None:
+        super().__init__()
+
+        self.logs = {}
+
+        log_folders = logs.split(",")
+
+        def matched_model(_model_args):
+            if model_name and model_name != _model_args["model"]:
+                return False
+
+            if model_args:
+                _model_args_list = model_args.split(",")
+
+                for _model_arg in _model_args_list:
+                    if _model_arg not in _model_args["model_args"]:
+                        return False
+
+            if not have_limits and _model_args["limit"] is not None:
+                return False
+
+            return True
+
+        for log_folder in log_folders:
+            for root, dirs, files in os.walk(log_folder):
+                for file in files:
+                    if file.endswith(".json"):
+                        try:
+                            log_file = os.path.join(root, file)
+
+                            with open(log_file, "r") as f:
+                                log_data = json.load(f)
+
+                            # check if model is matched
+                            _model_args = log_data["args"]
+                            if not matched_model(_model_args):
+                                raise Exception("Model not matched")
+
+                            # load logs
+                            logs = {}
+                            for data in log_data["logs"]:
+                                id = data["doc_id"]
+                                response = data["resps"][0]
+                                logs[id] = response
+
+                            task = log_data["model_configs"]["task"]
+
+                            pattern = re.compile(r"\d{4}_\d{4}")
+
+                            if "time" in log_data:
+                                log_time = log_data["time"]
+                            elif pattern.search(os.path.abspath(log_file)):
+                                log_time = pattern.findall(os.path.abspath(log_file))[-1]
+                            else:
+                                log_time = "unknown"
+
+                            if task not in self.logs or (self.logs[task]["time"] == "unknown" or datetime.strptime(log_time, "%m%d_%H%M") > datetime.strptime(self.logs[task]["time"], "%m%d_%H%M")):
+                                self.logs[task] = {"time": log_time, "logs": logs}
+
+                        except Exception as e:
+                            pass
+
+        accelerator = Accelerator()
+        if accelerator.num_processes > 1:
+            assert accelerator.distributed_type in [DistributedType.FSDP, DistributedType.MULTI_GPU, DistributedType.DEEPSPEED], "Unsupported distributed type provided. Only DDP and FSDP are supported."
+            self.accelerator = accelerator
+            if self.accelerator.is_local_main_process:
+                eval_logger.info(f"Using {accelerator.num_processes} devices with data parallelism")
+            self._rank = self.accelerator.local_process_index
+            self._world_size = self.accelerator.num_processes
+        else:
+            self.accelerator = accelerator
+            self._rank = self.accelerator.local_process_index
+            self._world_size = self.accelerator.num_processes
+
+        self.device = self.accelerator.device
+
+    def generate_until(self, requests) -> List[str]:
+        res = []
+        pbar = tqdm(total=len(requests), disable=(self.rank != 0), desc="Model Responding")
+
+        for contexts, gen_kwargs, doc_to_visual, doc_id, task, split in [reg.args for reg in requests]:
+            response = self.logs[task]["logs"][doc_id]
+            res.append(response[0])
+            pbar.update(1)
+
+        pbar.close()
+        return res
+
+    def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]:
+        # TODO
+        assert False, "not support"
diff --git a/lmms_eval/models/fuyu.py b/lmms_eval/models/fuyu.py
old mode 100644
new mode 100755
index 7a4844dc..5960d69e
--- a/lmms_eval/models/fuyu.py
+++ b/lmms_eval/models/fuyu.py
@@ -21,8 +21,6 @@
 
 eval_logger = logging.getLogger("lmms-eval")
 
-eval_logger = logging.getLogger("lmms-eval")
-
 
 @register_model("fuyu")
 class Fuyu(lmms):
@@ -85,7 +83,7 @@ def __init__(
             self._rank = 0
             self._word_size = 1
 
-        '''if accelerator.num_processes > 1:
+        """if accelerator.num_processes > 1:
             assert accelerator.distributed_type in [
                 DistributedType.FSDP,
                 DistributedType.MULTI_GPU,
@@ -98,7 +96,7 @@ def __init__(
             if self.accelerator.is_local_main_process:
                 eval_logger.info(f"Using {accelerator.num_processes} devices with data parallelism")
             self._rank = self.accelerator.local_process_index
-            self._world_size = self.accelerator.num_processes'''
+            self._world_size = self.accelerator.num_processes"""
 
     @property
     def config(self):
@@ -204,7 +202,7 @@ def _collate(x):
             # generation_output = self.model.generate(
             #     **model_inputs, temperature=gen_kwargs["temperature"], max_new_tokens=gen_kwargs["max_new_tokens"], top_p=gen_kwargs["top_p"], num_beams=gen_kwargs["num_beams"], pad_token_id=self.tokenizer.eos_token_id
             # )
-            generation_output = self.model.generate(**model_inputs, max_new_tokens=gen_kwargs["max_new_tokens"])
+            generation_output = self.model.generate(**model_inputs, max_new_tokens=gen_kwargs["max_new_tokens"], pad_token_id=self.tokenizer.eos_token_id)
             generation_texts = self.processor.batch_decode(generation_output, skip_special_tokens=True)
             response = [gen_text.split("\x04")[1].strip(" ").strip("\n") for gen_text in generation_texts]
             res.extend(response)
diff --git a/lmms_eval/models/gemini_api.py b/lmms_eval/models/gemini_api.py
new file mode 100644
index 00000000..0b2be05e
--- /dev/null
+++ b/lmms_eval/models/gemini_api.py
@@ -0,0 +1,185 @@
+import io
+import os
+import time
+import logging
+import json
+
+from PIL import Image
+from typing import List, Tuple
+from tqdm import tqdm
+from lmms_eval.api.registry import register_model
+from lmms_eval.api.model import lmms
+from lmms_eval.api.instance import Instance
+from accelerate import Accelerator, DistributedType
+
+eval_logger = logging.getLogger("lmms-eval")
+
+try:
+    import google.generativeai as genai
+    from google.generativeai.types import HarmCategory, HarmBlockThreshold
+
+    NUM_SECONDS_TO_SLEEP = 30
+    GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
+    genai.configure(api_key=GOOGLE_API_KEY)
+
+except Exception as e:
+    eval_logger.error(f"Error importing generativeai: {str(e)}")
+    genai = None
+
+
+@register_model("gemini_api")
+class GeminiAPI(lmms):
+    def __init__(
+        self,
+        model_version: str = "gemini-1.5-flash-latest",
+        modality: str = "image",
+        timeout: int = 120,
+        continual_mode: bool = False,
+        response_persistent_folder: str = None,  # We will cache the Gemini API response in this path and use it for future requests
+        **kwargs,
+    ) -> None:
+        super().__init__()
+        self.model_version = model_version
+        self.timeout = timeout
+        self.model = genai.GenerativeModel(model_version)
+        self.continual_mode = continual_mode
+        if self.continual_mode and response_persistent_folder is None:
+            raise ValueError("Continual mode requires a persistent path for the response. We will cache the Gemini API response in this path and use it for future requests. Please provide a valid path.")
+        self.response_persistent_folder = response_persistent_folder
+        self.response_persistent_file = os.path.join(self.response_persistent_folder, f"{self.model_version}_response.json")
+
+        if os.path.exists(self.response_persistent_file):
+            with open(self.response_persistent_file, "r") as f:
+                self.response_cache = json.load(f)
+            self.cache_mode = "resume"
+        else:
+            self.response_cache = {}
+            self.cache_mode = "start"
+
+        accelerator = Accelerator()
+        if accelerator.num_processes > 1:
+            assert self.continual_mode is False, "Continual mode is not supported with distributed inference."
+            assert accelerator.distributed_type in [DistributedType.FSDP, DistributedType.MULTI_GPU, DistributedType.DEEPSPEED], "Unsupported distributed type provided. Only DDP and FSDP are supported."
+            self.accelerator = accelerator
+            if self.accelerator.is_local_main_process:
+                eval_logger.info(f"Using {accelerator.num_processes} devices with data parallelism")
+            self._rank = self.accelerator.local_process_index
+            self._world_size = self.accelerator.num_processes
+        else:
+            self.accelerator = accelerator
+            self._rank = self.accelerator.local_process_index
+            self._world_size = self.accelerator.num_processes
+
+        self.device = self.accelerator.device
+
+        self.modality = modality
+
+    def flatten(self, input):
+        new_list = []
+        for i in input:
+            for j in i:
+                new_list.append(j)
+        return new_list
+
+    def get_image_size(self, image):
+        # Create a BytesIO object to store the image bytes
+        img_byte_array = io.BytesIO()
+
+        # Save the image to the BytesIO object
+        image.save(img_byte_array, format="PNG")
+
+        # Get the size of the BytesIO object
+        img_size = img_byte_array.tell()
+
+        return img_size
+
+    def encode_video(self, video_path):
+        uploaded_obj = genai.upload_file(path=video_path)
+        time.sleep(5)
+        return uploaded_obj
+
+    def convert_video(self, images):
+        for idx, img in enumerate(images):
+            if self.modality == "video" and isinstance(img, str):
+                try:
+                    images[idx] = self.encode_video(img)
+                except Exception as e:
+                    eval_logger.error(f"Error converting video: {str(e)}")
+        return images
+
+    def generate_until(self, requests) -> List[str]:
+        res = []
+        pbar = tqdm(total=len(requests), disable=(self.rank != 0), desc="Model Responding")
+
+        def get_uuid(task, split, doc_id):
+            return f"{task}___{split}___{doc_id}"
+
+        for contexts, gen_kwargs, doc_to_visual, doc_id, task, split in [reg.args for reg in requests]:
+            if self.continual_mode is True and self.cache_mode == "resume":
+                doc_uuid = get_uuid(task, split, doc_id)
+                if doc_uuid in self.response_cache:
+                    content = self.response_cache[doc_uuid]
+                    if content:
+                        res.append(content)
+                        pbar.update(1)
+                        continue
+
+            if "max_new_tokens" not in gen_kwargs:
+                gen_kwargs["max_new_tokens"] = 1024
+            if "temperature" not in gen_kwargs:
+                gen_kwargs["temperature"] = 0
+
+            config = genai.GenerationConfig(
+                max_output_tokens=gen_kwargs["max_new_tokens"],
+                temperature=gen_kwargs["temperature"],
+            )
+
+            visuals = [doc_to_visual(self.task_dict[task][split][doc_id])]
+            visuals = self.flatten(visuals)
+            visuals = self.convert_video(visuals)
+
+            message = [contexts] + visuals
+
+            for attempt in range(5):
+                try:
+                    content = self.model.generate_content(
+                        message,
+                        generation_config=config,
+                        safety_settings={
+                            HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE,
+                            HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_NONE,
+                            HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_NONE,
+                            HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_NONE,
+                        },
+                    )
+                    content = content.text
+                    break
+                except Exception as e:
+                    eval_logger.info(f"Attempt {attempt + 1} failed with error: {str(e)}")
+                    if isinstance(e, ValueError):
+                        try:
+                            eval_logger.info(f"Prompt feed_back: {content.prompt_feedback}")
+                            content = ""
+                            break
+                        except Exception:
+                            pass
+                    if attempt < 5 - 1:  # If we have retries left, sleep and then continue to next attempt
+                        time.sleep(NUM_SECONDS_TO_SLEEP)
+                    else:  # If this was the last attempt, log and return empty
+                        eval_logger.error(f"All 5 attempts failed. Last error message: {str(e)}")
+                        content = ""
+            res.append(content)
+            pbar.update(1)
+
+            if self.continual_mode is True:  # Cache the response
+                doc_uuid = get_uuid(task, split, doc_id)
+                self.response_cache[doc_uuid] = content
+                with open(self.response_persistent_file, "w") as f:
+                    json.dump(self.response_cache, f)
+
+        pbar.close()
+        return res
+
+    def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]:
+        # TODO
+        assert False, "Gemini API not support"
diff --git a/lmms_eval/models/gpt4v.py b/lmms_eval/models/gpt4v.py
old mode 100644
new mode 100755
index 0194f8ed..d28c1b00
--- a/lmms_eval/models/gpt4v.py
+++ b/lmms_eval/models/gpt4v.py
@@ -1,5 +1,6 @@
 from io import BytesIO
 from copy import deepcopy
+import numpy as np
 import os
 import base64
 from typing import List, Tuple
@@ -13,10 +14,18 @@
 from lmms_eval.api.registry import register_model
 from lmms_eval import utils
 
+from accelerate import Accelerator, DistributedType, InitProcessGroupKwargs
+from accelerate.state import AcceleratorState
+
+try:
+    from decord import VideoReader, cpu
+except ImportError:
+    pass
+
 from PIL import Image
 
 API_TYPE = os.getenv("API_TYPE", "openai")
-NUM_SECONDS_TO_SLEEP = 5
+NUM_SECONDS_TO_SLEEP = 30
 eval_logger = logging.getLogger("lmms-eval")
 
 if API_TYPE == "openai":
@@ -40,6 +49,9 @@ class GPT4V(lmms):
     def __init__(
         self,
         model_version: str = "gpt-4-vision-preview",
+        modality: str = "video",
+        max_frames_for_video: int = 10,
+        timeout: int = 120,
         **kwargs,
     ) -> None:
         super().__init__()
@@ -47,7 +59,26 @@ def __init__(
         # and split the text and image
         # Here we just use the same token as llava for convenient
         self.model_version = model_version
+        self.modality = modality
+        self.max_frames_for_video = max_frames_for_video
         self.image_token = "<image>"
+        self.timeout = timeout
+
+        accelerator = Accelerator()
+        # assert self.batch_size_per_gpu == 1, "Llava currently does not support batched generation. See https://github.com/haotian-liu/LLaVA/issues/754. HF Llava also has this issue."
+        if accelerator.num_processes > 1:
+            assert accelerator.distributed_type in [DistributedType.FSDP, DistributedType.MULTI_GPU, DistributedType.DEEPSPEED], "Unsupported distributed type provided. Only DDP and FSDP are supported."
+            self.accelerator = accelerator
+            if self.accelerator.is_local_main_process:
+                eval_logger.info(f"Using {accelerator.num_processes} devices with data parallelism")
+            self._rank = self.accelerator.local_process_index
+            self._world_size = self.accelerator.num_processes
+        else:
+            self.accelerator = accelerator
+            self._rank = self.accelerator.local_process_index
+            self._world_size = self.accelerator.num_processes
+
+        self.device = self.accelerator.device
 
     # Function to encode the image
     def encode_image(self, image: Image):
@@ -57,6 +88,25 @@ def encode_image(self, image: Image):
         base64_str = base64.b64encode(byte_data).decode("utf-8")
         return base64_str
 
+    # Function to encode the video
+    def encode_video(self, video_path, for_get_frames_num):
+        vr = VideoReader(video_path, ctx=cpu(0))
+        total_frame_num = len(vr)
+        uniform_sampled_frames = np.linspace(0, total_frame_num - 1, for_get_frames_num, dtype=int)
+        frame_idx = uniform_sampled_frames.tolist()
+        frames = vr.get_batch(frame_idx).asnumpy()
+
+        base64_frames = []
+        for frame in frames:
+            img = Image.fromarray(frame)
+            output_buffer = BytesIO()
+            img.save(output_buffer, format="PNG")
+            byte_data = output_buffer.getvalue()
+            base64_str = base64.b64encode(byte_data).decode("utf-8")
+            base64_frames.append(base64_str)
+
+        return base64_frames
+
     def flatten(self, input):
         new_list = []
         for i in input:
@@ -70,12 +120,17 @@ def generate_until(self, requests) -> List[str]:
 
         for contexts, gen_kwargs, doc_to_visual, doc_id, task, split in [reg.args for reg in requests]:
             # encode, pad, and truncate contexts for this batch
-            visuals = [doc_to_visual(self.task_dict[task][split][doc_id])]
+            # visuals = [doc_to_visual(self.task_dict[task][split][doc_id])]
+            visuals = [doc_to_visual(self.task_dict[task][split][0])]
             visuals = self.flatten(visuals)
-            imgs = []
+            imgs = []  # multiple images or frames for video
             for visual in visuals:
-                img = self.encode_image(visual)
-                imgs.append(img)
+                if self.modality == "image":
+                    img = self.encode_image(visual)
+                    imgs.append(img)
+                elif self.modality == "video":
+                    frames = self.encode_video(visual, self.max_frames_for_video)
+                    imgs.extend(frames)
 
             payload = {"model": self.model_version, "messages": []}
             response_json = {"role": "user", "content": []}
@@ -107,12 +162,12 @@ def generate_until(self, requests) -> List[str]:
             if "num_beams" not in gen_kwargs:
                 gen_kwargs["num_beams"] = 1
 
-            # payload["max_tokens"] = gen_kwargs["max_new_tokens"]
-            # payload["temperature"] = gen_kwargs["temperature"]
+            payload["max_tokens"] = gen_kwargs["max_new_tokens"]
+            payload["temperature"] = gen_kwargs["temperature"]
 
             for attempt in range(5):
                 try:
-                    response = url_requests.post(API_URL, headers=headers, json=payload, timeout=20)
+                    response = url_requests.post(API_URL, headers=headers, json=payload, timeout=self.timeout)
                     response_data = response.json()
 
                     content = response_data["choices"][0]["message"]["content"].strip()
@@ -124,9 +179,11 @@ def generate_until(self, requests) -> List[str]:
                         time.sleep(NUM_SECONDS_TO_SLEEP)
                     else:  # If this was the last attempt, log and return empty
                         eval_logger.error(f"All 5 attempts failed. Last error message: {str(e)}")
+                        eval_logger.error(f"Response: {response}")
                         content = ""
             res.append(content)
             pbar.update(1)
+        pbar.close()
         return res
 
     def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]:
diff --git a/lmms_eval/models/idefics2.py b/lmms_eval/models/idefics2.py
index 7419472b..9e274c8a 100644
--- a/lmms_eval/models/idefics2.py
+++ b/lmms_eval/models/idefics2.py
@@ -17,12 +17,14 @@
 eval_logger = logging.getLogger("lmms-eval")
 
 DEFAULT_IMAGE_TOKEN = "<image>"
-try: 
+try:
     import flash_attn
+
     best_fit_attn_implementation = "flash_attention_2"
 except ImportError:
     best_fit_attn_implementation = "eager"
 
+
 @register_model("idefics2")
 class Idefics2(lmms):
     """
@@ -50,7 +52,7 @@ def __init__(
         attn_implementation: Optional[str] = best_fit_attn_implementation,
         device_map: str = "",
         use_cache: bool = True,
-        do_image_splitting: bool =False,
+        do_image_splitting: bool = False,
         **kwargs,
     ) -> None:
         super().__init__()
@@ -194,9 +196,14 @@ def _collate(x):
             # we assume all gen kwargs in the batch are the same
             # this is safe to assume because the `grouper` object ensures it.
             gen_kwargs = all_gen_kwargs[0]
-            # 
+            #
             until = gen_kwargs.pop("until", None)
-            image_aspect_ratio = gen_kwargs.pop("image_aspect_ratio",  None)
+            image_aspect_ratio = gen_kwargs.pop("image_aspect_ratio", None)
+            if "max_new_tokens" not in gen_kwargs:
+                gen_kwargs["max_new_tokens"] = 1024
+            if "temperature" not in gen_kwargs:
+                gen_kwargs["temperature"] = 0
+                
             prompts = []
             for context, visual in zip(contexts, visuals):
                 content = []
@@ -212,9 +219,9 @@ def _collate(x):
             output_ids = self.model.generate(**inputs, **gen_kwargs)
             # only retain the generated text
             for output_id, input_id in zip(output_ids, inputs["input_ids"]):
-                generated_id = output_id[len(input_id):]
+                generated_id = output_id[len(input_id) :]
                 generated_text = self.tokenizer.decode(generated_id, skip_special_tokens=True)
-                
+
                 res.append(generated_text)
             pbar.update(1)
         # reorder this group of results back to original unsorted form
diff --git a/lmms_eval/models/instructblip.py b/lmms_eval/models/instructblip.py
old mode 100644
new mode 100755
index 2f065ffe..3ca068ed
--- a/lmms_eval/models/instructblip.py
+++ b/lmms_eval/models/instructblip.py
@@ -10,6 +10,7 @@
 from accelerate import Accelerator, DistributedType
 from accelerate.state import AcceleratorState
 from typing import List, Optional, Union, Tuple
+import transformers
 from transformers import InstructBlipProcessor, InstructBlipForConditionalGeneration
 
 from lmms_eval.utils import stop_sequences_criteria
@@ -20,6 +21,7 @@
 warnings.filterwarnings("ignore")
 
 eval_logger = logging.getLogger("lmms-eval")
+transformers.logging.set_verbosity_error()
 
 
 @register_model("instructblip")
diff --git a/lmms_eval/models/internvl.py b/lmms_eval/models/internvl.py
new file mode 100644
index 00000000..d808081a
--- /dev/null
+++ b/lmms_eval/models/internvl.py
@@ -0,0 +1,485 @@
+import logging
+import os
+from accelerate import Accelerator, DistributedType, InitProcessGroupKwargs
+from accelerate.state import AcceleratorState
+from typing import List, Optional, Union, Tuple
+import torch
+from tqdm import tqdm
+import numpy as np
+import math
+from datetime import timedelta
+from transformers import AutoConfig
+from huggingface_hub import snapshot_download
+import requests
+
+from lmms_eval import utils
+from lmms_eval.api.instance import Instance
+from lmms_eval.api.model import lmms
+from lmms_eval.api.registry import register_model
+from lmms_eval.utils import stop_sequences_criteria
+from PIL import Image
+
+import subprocess
+from pathlib import Path
+
+wd = Path(__file__).parent.parent.parent.resolve()
+import sys
+
+sys.path.append(os.path.join(str(wd), "InternVL", "internvl_chat"))
+eval_logger = logging.getLogger("lmms-eval")
+
+if not hasattr(eval_logger, "internvl_warning_logged"):
+    eval_logger.internvl_warning_logged = False
+
+try:
+    from internvl.model.internlm2.modeling_internlm2 import InternLM2ForCausalLM
+    from internvl.model.internvl_chat.configuration_internvl_chat import InternVLChatConfig
+    from internvl.model.internvl_chat.modeling_intern_vit import InternVisionModel
+    from internvl.model.internvl_chat import InternVLChatModel
+    from internvl.train.dataset import build_transform, dynamic_preprocess
+except ImportError:
+    eval_logger.debug("InternVL is not installed. Please install InternVL to use this model.")
+    if not eval_logger.internvl_warning_logged:
+        eval_logger.debug("InternVL is not installed. Please install InternVL to use this model.")
+        eval_logger.internvl_warning_logged = True
+
+import warnings
+from typing import Any, List, Optional, Tuple, Union
+
+import torch.utils.checkpoint
+
+from peft import LoraConfig, get_peft_model
+from torch import nn
+from torch.nn import CrossEntropyLoss
+from transformers import AutoModel, GenerationConfig, LlamaForCausalLM, LlamaTokenizer
+from transformers.modeling_outputs import CausalLMOutputWithPast
+from transformers.modeling_utils import PreTrainedModel
+from transformers import AutoTokenizer
+import re
+from huggingface_hub import snapshot_download
+
+
+@register_model("internvl")
+class InternVLChat(lmms):
+    # config_class = InternVLChatConfig
+    main_input_name = "pixel_values"
+    _no_split_modules = ["InternVisionEncoderLayer", "LlamaDecoderLayer"]
+
+    """
+    0. Install lmms-eval
+    cd lmms-eval
+    pip install -e .
+
+    How to Install InternVL:
+    1. Clone the InternVL repository:
+    git clone https://github.com/OpenGVLab/InternVL.git
+
+    2. Install the requirements:
+    pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118
+
+    3. Install flash-attn==2.3.6:
+    pip install flash-attn==2.3.6 --no-build-isolation
+    """
+
+    """
+    How to download the pretrained model:
+    1. Download the pretrained model from hugginface:
+    cd pretrained/
+    # pip install -U huggingface_hub
+    huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL-Chat-V1-5 --local-dir InternVL-Chat-V1-5
+
+    2. the pretrained model should be in the following directory:
+    pretrained
+    └── InternVL-Chat-V1-5
+    """
+
+    #
+    # The above steps can be optional, I add snapshot download, so now can just use hf repo_id
+    # model_args pretrained=OpenGVLab/InternVL-Chat-V1-5
+    #
+
+    """
+    InternVL-Chat-V1-5 Model for OpenGVLab https://github.com/OpenGVLab/InternVL/blob/main/internvl_chat/internvl/model/internvl_chat/modeling_internvl_chat.py
+    Example usage:
+
+    accelerate launch --num_processes=8 --main_process_port 12345 -m lmms_eval \
+        --model internvl \
+        --model_args pretrained=OpenGVLab/InternVL-Chat-V1-5 \
+        --tasks llava_wilder_small \
+        --batch_size 1 \
+        --output_path ./logs/ \
+        --log_samples
+    """
+
+    def __init__(
+        self,
+        config=None,
+        pretrained: str = "OpenGVLab/InternVL-Chat-V1-5",
+        truncation: Optional[bool] = True,
+        device: Optional[str] = "cuda:0",
+        dtype: Optional[Union[str, torch.dtype]] = "auto",
+        batch_size: Optional[Union[int, str]] = 1,
+        trust_remote_code: Optional[bool] = False,
+        revision=None,
+        device_map="cuda:0",
+        conv_template="vicuna_v1",
+        use_cache=True,
+        truncate_context=False,  # whether to truncate the context in generation, set it False for LLaVA-1.6
+        customized_config=None,  # ends in json
+        dynamic=True,
+        load_in_8bit=False,
+        vision_model=None,
+        language_model=None,
+        max_num=12,
+        **kwargs,
+    ) -> None:
+        super().__init__()
+
+        assert kwargs == {}, f"Unexpected kwargs: {kwargs}"
+
+        accelerator_kwargs = InitProcessGroupKwargs(timeout=timedelta(weeks=52))
+        accelerator = Accelerator(kwargs_handlers=[accelerator_kwargs])
+        if accelerator.num_processes > 1:
+            self._device = torch.device(f"cuda:{accelerator.local_process_index}")
+            self.device_map = f"cuda:{accelerator.local_process_index}"
+        elif accelerator.num_processes == 1 and device_map == "auto":
+            self._device = torch.device(device)
+            self.device_map = device_map
+        else:
+            self._device = torch.device(f"cuda:{accelerator.local_process_index}")
+            self.device_map = f"cuda:{accelerator.local_process_index}"
+
+        self.dynamic = dynamic  # dynamic image_size
+        self.max_num = max_num
+        if accelerator.is_main_process:
+            cache_dir = snapshot_download(repo_id=pretrained, cache_dir="cache_dir", local_dir="cache_dir", local_dir_use_symlinks=False)
+        accelerator.wait_for_everyone()
+        # So what I did is that I let main process to download the repo, and then
+        # other process can just simply read from this repo
+        cache_dir = snapshot_download(repo_id=pretrained, cache_dir="cache_dir", local_dir="cache_dir", local_dir_use_symlinks=False)
+        config = InternVLChatConfig.from_pretrained(cache_dir)
+        tokenizer = AutoTokenizer.from_pretrained(cache_dir, trust_remote_code=True, use_fast=False)
+        model = InternVLChatModel.from_pretrained(cache_dir, low_cpu_mem_usage=True, config=config, torch_dtype=torch.bfloat16, load_in_8bit=load_in_8bit).eval()
+        if not load_in_8bit:
+            model = model.cuda()
+        # self.model=model
+        # self.device=self._device
+        self._tokenizer = tokenizer
+        # self.tokenizer=tokenizer
+        self._model = model
+        self._config = self._model.config
+        self.use_thumbnail = self.model.config.use_thumbnail
+        self.model.eval()
+        self.model.tie_weights()
+        self.truncation = truncation
+        self.batch_size_per_gpu = int(batch_size)
+        self.conv_template = conv_template
+        self.use_cache = use_cache
+        self.truncate_context = truncate_context
+        if accelerator.num_processes > 1:
+            assert accelerator.distributed_type in [DistributedType.FSDP, DistributedType.MULTI_GPU, DistributedType.DEEPSPEED], "Unsupported distributed type provided. Only DDP and FSDP are supported."
+            # If you want to use DistributedType.DEEPSPEED, you have to run accelerate config before using the model
+            # Also, you have to select zero stage 0 (equivalent to DDP) in order to make the prepare model works
+            # I tried to set different parameters in the kwargs to let default zero 2 stage works, but it didn't work.
+            if accelerator.distributed_type == DistributedType.DEEPSPEED:
+                kwargs = {
+                    "train_micro_batch_size_per_gpu": self.batch_size_per_gpu,
+                    "train_batch_size": self.batch_size_per_gpu * accelerator.num_processes,
+                }
+                AcceleratorState().deepspeed_plugin.deepspeed_config_process(must_match=True, **kwargs)
+                eval_logger.info("Detected that you are using DistributedType.DEEPSPEED. Make sure you run `accelerate config` and set zero stage to 0")
+
+            if accelerator.distributed_type == DistributedType.FSDP or accelerator.distributed_type == DistributedType.DEEPSPEED:
+                self._model = accelerator.prepare(self.model)
+            else:
+                self._model = accelerator.prepare_model(self.model, evaluation_mode=True)
+            self.accelerator = accelerator
+            if self.accelerator.is_local_main_process:
+                eval_logger.info(f"Using {accelerator.num_processes} devices with data parallelism")
+            self._rank = self.accelerator.local_process_index
+            self._world_size = self.accelerator.num_processes
+        elif accelerator.num_processes == 1 and device_map == "auto":
+            eval_logger.info(f"Using {accelerator.num_processes} devices with tensor parallelism")
+            self._rank = 0
+            self._word_size = 1
+        else:
+            eval_logger.info(f"Using single device: {self._device}")
+            self.model.to(self._device)
+            self._rank = 0
+            self._world_size = 1
+
+        # from internvl model
+
+        self.image_size = config.force_image_size or config.vision_config.image_size
+
+    def wrap_backbone_lora(self, r=128, lora_alpha=256, lora_dropout=0.05):
+        lora_config = LoraConfig(
+            r=r,
+            target_modules=["attn.qkv", "attn.proj", "mlp.fc1", "mlp.fc2"],
+            lora_alpha=lora_alpha,
+            lora_dropout=lora_dropout,
+        )
+        self.vision_model = get_peft_model(self.vision_model, lora_config)
+        self.vision_model.print_trainable_parameters()
+
+    def wrap_llm_lora(self, r=128, lora_alpha=256, lora_dropout=0.05):
+        lora_config = LoraConfig(
+            r=r, target_modules=["self_attn.q_proj", "self_attn.k_proj", "self_attn.v_proj", "self_attn.o_proj", "mlp.gate_proj", "mlp.down_proj", "mlp.up_proj"], lora_alpha=lora_alpha, lora_dropout=lora_dropout, task_type="CAUSAL_LM"
+        )
+        self.language_model = get_peft_model(self.language_model, lora_config)
+        self.language_model.enable_input_require_grads()
+        self.language_model.print_trainable_parameters()
+
+    def pixel_shuffle(self, x, scale_factor=0.5):
+        n, w, h, c = x.size()
+        # N, W, H, C --> N, W, H * scale, C // scale
+        x = x.view(n, w, int(h * scale_factor), int(c / scale_factor))
+        # N, W, H * scale, C // scale --> N, H * scale, W, C // scale
+        x = x.permute(0, 2, 1, 3).contiguous()
+        # N, H * scale, W, C // scale --> N, H * scale, W * scale, C // (scale ** 2)
+        x = x.view(n, int(h * scale_factor), int(w * scale_factor), int(c / (scale_factor * scale_factor)))
+        if self.ps_version == "v1":
+            warnings.warn("In ps_version 'v1', the height and width have not been swapped back, " "which results in a transposed image.")
+        else:
+            x = x.permute(0, 2, 1, 3).contiguous()
+        return x
+
+    def noised_embed(self, vit_embeds, noise_alpha=5):
+        dims = torch.tensor(vit_embeds.size(1) * vit_embeds.size(2))
+        mag_norm = noise_alpha / torch.sqrt(dims)
+        noise = torch.zeros_like(vit_embeds).uniform_(-mag_norm, mag_norm)
+        return vit_embeds + noise
+
+    def extract_feature(self, pixel_values):
+        if self.select_layer == -1:
+            vit_embeds = self.vision_model(pixel_values=pixel_values, output_hidden_states=False, return_dict=True).last_hidden_state
+        else:
+            vit_embeds = self.vision_model(pixel_values=pixel_values, output_hidden_states=True, return_dict=True).hidden_states[self.select_layer]
+        vit_embeds = vit_embeds[:, 1:, :]
+
+        if self.training and self.neftune_alpha is not None:
+            vit_embeds = self.noised_embed(vit_embeds, self.neftune_alpha)
+
+        h = w = int(vit_embeds.shape[1] ** 0.5)
+        vit_embeds = vit_embeds.reshape(vit_embeds.shape[0], h, w, -1)
+        vit_embeds = self.pixel_shuffle(vit_embeds, scale_factor=self.downsample_ratio)
+        vit_embeds = vit_embeds.reshape(vit_embeds.shape[0], -1, vit_embeds.shape[-1])
+        vit_embeds = self.mlp1(vit_embeds)  # .to(pixel_values.device)
+        return vit_embeds
+
+    def multi_image_chat(self, tokenizer, pixel_values, image_counts, question, generation_config, history=None, return_history=False, IMG_START_TOKEN="<img>", IMG_END_TOKEN="</img>", IMG_CONTEXT_TOKEN="<IMG_CONTEXT>"):
+        img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
+        self.img_context_token_id = img_context_token_id
+        if tokenizer.convert_tokens_to_ids("<|im_end|>") != 0:
+            eos_token_id = tokenizer.convert_tokens_to_ids("<|im_end|>")  # 92542, InternLM2
+        else:
+            eos_token_id = tokenizer.eos_token_id
+
+        from internvl.conversation import get_conv_template
+
+        template = get_conv_template(self.template)
+
+        if history is None:
+            history = []
+            image_tokens = ""
+            image_bs = pixel_values.shape[0]
+            # print(f"dynamic ViT batch size: {image_bs}, image_counts: {image_counts}")
+            for idx, image_count in enumerate(image_counts):
+                image_tokens += f"<image {idx+1}> (图{idx+1}):" + IMG_START_TOKEN + IMG_CONTEXT_TOKEN * self.num_image_token * image_count + IMG_END_TOKEN
+            question = image_tokens + "\n" + question
+        else:
+            for old_question, old_answer in history:
+                template.append_message(template.roles[0], old_question)
+                template.append_message(template.roles[1], old_answer)
+        template.append_message(template.roles[0], question)
+        template.append_message(template.roles[1], None)
+        query = template.get_prompt()
+        model_inputs = tokenizer(query, return_tensors="pt")
+        input_ids = model_inputs["input_ids"].cuda()
+        attention_mask = model_inputs["attention_mask"].cuda()
+        generation_config["eos_token_id"] = eos_token_id
+
+        generation_output = self.generate(pixel_values=pixel_values, input_ids=input_ids, attention_mask=attention_mask, **generation_config)
+        response = tokenizer.batch_decode(generation_output, skip_special_tokens=True)[0]
+        response = response.split("<|im_end|>")[0].strip()  # for InternLM2
+        history.append((question, response))
+        if return_history:
+            return response, history
+        else:
+            query_to_print = query.replace(image_tokens, "<image>")
+            # print(query_to_print, response)
+            return response
+        return response
+
+    @property
+    def tokenizer(self):
+        return self._tokenizer
+
+    @property
+    def model(self):
+        # returns the model, unwrapping it if using Accelerate
+        if hasattr(self, "accelerator"):
+            return self.accelerator.unwrap_model(self._model)
+        else:
+            return self._model
+
+    @property
+    def batch_size(self):
+        return self.batch_size_per_gpu
+
+    @property
+    def device(self):
+        return self._device
+
+    @property
+    def rank(self):
+        return self._rank
+
+    @property
+    def world_size(self):
+        return self._world_size
+
+    def tok_encode(self, string: str, left_truncate_len=None, add_special_tokens=None) -> List[int]:
+        """ """
+        add_special_tokens = False if add_special_tokens is None else add_special_tokens
+        encoding = self.tokenizer.encode(string, add_special_tokens=add_special_tokens)
+        # left-truncate the encoded context to be at most `left_truncate_len` tokens long
+        if left_truncate_len:
+            encoding = encoding[-left_truncate_len:]
+        return encoding
+
+    def tok_decode(self, tokens):
+        try:
+            return self.tokenizer.decode(tokens)
+        except:
+            return self.tokenizer.decode([tokens])
+
+    def post_processing(self, response):
+        response = response.replace("\n", "").replace("不是", "No").replace("是", "Yes").replace("否", "No")
+        response = response.lower().replace("true", "yes").replace("false", "no")
+        pattern = re.compile(r"[\u4e00-\u9fa5]")
+        response = re.sub(pattern, "", response)
+        return response
+
+    @torch.no_grad()
+    def generate(
+        self,
+        pixel_values: Optional[torch.FloatTensor] = None,
+        input_ids: Optional[torch.FloatTensor] = None,
+        attention_mask: Optional[torch.LongTensor] = None,
+        visual_features: Optional[torch.FloatTensor] = None,
+        generation_config: Optional[GenerationConfig] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        **generate_kwargs,
+    ) -> torch.LongTensor:
+        assert self.img_context_token_id is not None
+        if pixel_values is not None:
+            if visual_features is not None:
+                vit_embeds = visual_features
+            else:
+                vit_embeds = self.extract_feature(pixel_values)
+
+            input_embeds = self.language_model.get_input_embeddings()(input_ids)
+            B, N, C = input_embeds.shape
+            input_embeds = input_embeds.reshape(B * N, C)
+
+            input_ids = input_ids.reshape(B * N)
+            selected = input_ids == self.img_context_token_id
+            assert selected.sum() != 0
+            input_embeds[selected] = vit_embeds.reshape(-1, C).to(input_embeds.device)
+
+            input_embeds = input_embeds.reshape(B, N, C)
+        else:
+            input_embeds = self.language_model.get_input_embeddings()(input_ids)
+
+        outputs = self.language_model.generate(
+            inputs_embeds=input_embeds,
+            attention_mask=attention_mask,
+            generation_config=generation_config,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            use_cache=True,
+            **generate_kwargs,
+        )
+
+        return outputs
+
+    def flatten(self, input):
+        new_list = []
+        for i in input:
+            for j in i:
+                new_list.append(j)
+        return new_list
+
+    def load_image(self, flattened_visuals, input_size=224):
+        assert flattened_visuals[0].mode == "RGB"
+        image = flattened_visuals[0].convert("RGB")
+        transform = build_transform(is_train=False, input_size=input_size)
+        if self.dynamic:
+            images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=self.use_thumbnail, max_num=self.max_num)
+        else:
+            images = [image]
+        pixel_values = [transform(image) for image in images]
+        pixel_values = torch.stack(pixel_values)
+        return pixel_values
+
+    def generate_until(self, requests: List[Instance]) -> List[str]:
+        res = []
+
+        def _collate(x):
+            # the negative sign on len(toks) sorts descending - this has a few advantages:
+            # - time estimates will always be over not underestimates, which is more useful for planning
+            # - to know the size of a batch when going through the list, you know the first one is always the batch
+            #   padded context length. this is useful to simplify the batching logic and more importantly to make
+            #   automatic adaptive batches much much easier to implement
+            # - any OOMs will happen right away rather than near the end
+            toks = self.tok_encode(x[0])
+            return -len(toks), x[0]
+
+        # we group requests by their generation_kwargs,
+        # so that we don't try to execute e.g. greedy sampling and temp=0.8 sampling
+        # in the same batch.
+        re_ords = utils.Collator([reg.args for reg in requests], _collate, grouping=True)
+        chunks = re_ords.get_batched(n=self.batch_size, batch_fn=None)
+        num_iters = len(requests) // self.batch_size if len(requests) % self.batch_size == 0 else len(requests) // self.batch_size + 1
+        pbar = tqdm(total=num_iters, disable=(self.rank != 0), desc="Model Responding")
+        for chunk in chunks:
+            contexts, all_gen_kwargs, doc_to_visual, doc_id, task, split = zip(*chunk)
+            task = task[0]
+            split = split[0]
+            batched_visuals = [doc_to_visual[0](self.task_dict[task][split][ids]) for ids in doc_id]  # [B, N]
+            flattened_visuals = self.flatten(batched_visuals)
+            pixel_values = self.load_image(flattened_visuals, self.image_size).cuda().to(torch.bfloat16)
+            gen_kwargs = all_gen_kwargs[0]
+
+            if "max_new_tokens" not in gen_kwargs:
+                gen_kwargs["max_new_tokens"] = 1024
+            if "temperature" not in gen_kwargs:
+                gen_kwargs["temperature"] = 0
+            if "top_p" not in gen_kwargs:
+                gen_kwargs["top_p"] = None
+            if "num_beams" not in gen_kwargs:
+                gen_kwargs["num_beams"] = 1
+
+            generation_config = dict(
+                do_sample=False,
+                top_k=50,
+                top_p=gen_kwargs["top_p"],
+                num_beams=gen_kwargs["num_beams"],
+                max_new_tokens=gen_kwargs["max_new_tokens"],
+                eos_token_id=self.tokenizer.eos_token_id,
+            )
+            question = contexts[0]
+            response = self.model.chat(tokenizer=self.tokenizer, pixel_values=pixel_values, question=question, generation_config=generation_config)
+            # TODO(choiszt) try batch_chat for multiple inputs
+            response = self.post_processing(response)
+            res.append(response)
+            self.cache_hook.add_partial("generate_until", (question, gen_kwargs), response)
+            pbar.update(1)
+        res = re_ords.get_original(res)
+        return res
+        # print(chunk)
+
+    def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]:
+        pass
diff --git a/lmms_eval/models/llama_vid.py b/lmms_eval/models/llama_vid.py
new file mode 100644
index 00000000..69627fe8
--- /dev/null
+++ b/lmms_eval/models/llama_vid.py
@@ -0,0 +1,272 @@
+import logging
+import os
+from accelerate import Accelerator, DistributedType, InitProcessGroupKwargs
+from accelerate.state import AcceleratorState
+from typing import List, Optional, Union, Tuple
+import torch
+from tqdm import tqdm
+from decord import VideoReader, cpu
+import numpy as np
+import math
+from datetime import timedelta
+from transformers import AutoConfig
+from huggingface_hub import snapshot_download
+import requests
+
+from lmms_eval import utils
+from lmms_eval.api.instance import Instance
+from lmms_eval.api.model import lmms
+from lmms_eval.api.registry import register_model
+from lmms_eval.utils import stop_sequences_criteria
+from lmms_eval.models.model_utils.load_video import read_video_pyav
+
+import subprocess
+
+eval_logger = logging.getLogger("lmms-eval")
+
+try:
+    from llamavid.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
+    from llamavid.conversation import conv_templates, SeparatorStyle
+    from llamavid.model.builder import load_pretrained_model
+    from llava.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria
+except ImportError:
+    eval_logger.debug("LLaMA-Video is not installed. Please install LLaMA-Video to use this model.")
+
+
+@register_model("llama_vid")
+class LLaMAVid(lmms):
+    def __init__(
+        self,
+        pretrained: str = "YanweiLi/llama-vid-7b-full-224-video-fps-1",
+        truncation: Optional[bool] = True,
+        device: Optional[str] = "cuda:0",
+        dtype: Optional[Union[str, torch.dtype]] = "auto",
+        batch_size: Optional[Union[int, str]] = 1,
+        trust_remote_code: Optional[bool] = False,
+        revision=None,
+        attn_implementation=(
+            "sdpa" if torch.__version__ > "2.1.2" else "eager"
+        ),  # inference implementation for attention, can be "sdpa", "eager", "flash_attention_2". Seems FA2 is not effective during inference: https://discuss.huggingface.co/t/flash-attention-has-no-effect-on-inference/73453/5
+        device_map="cuda:0",
+        conv_template="vicuna_v1",
+        use_cache=True,
+        truncate_context=False,
+        num_frames: int = 100,
+        **kwargs,
+    ) -> None:
+        super().__init__()
+
+        accelerator_kwargs = InitProcessGroupKwargs(timeout=timedelta(weeks=52))
+        accelerator = Accelerator(kwargs_handlers=[accelerator_kwargs])
+        if accelerator.num_processes > 1:
+            self._device = torch.device(f"cuda:{accelerator.local_process_index}")
+            self.device_map = f"cuda:{accelerator.local_process_index}"
+        elif accelerator.num_processes == 1 and device_map == "auto":
+            self._device = torch.device(device)
+            self.device_map = device_map
+        else:
+            self._device = torch.device(f"cuda:{accelerator.local_process_index}")
+            self.device_map = f"cuda:{accelerator.local_process_index}"
+
+        self.pretrained = pretrained
+        self.model_path = snapshot_download(self.pretrained)
+        self.model_name = get_model_name_from_path(pretrained)
+        self.num_frames = num_frames
+        if not os.path.exists("./model_zoo/LAVIS/eva_vit_g.pth") and accelerator.is_main_process:
+            eval_logger.info("\n\n Eva Encoder is not found for LLaMA-VID. Download automatically to the folder ./model_zoo/LAVIS")
+            cache_path = "model_zoo/LAVIS"
+            os.makedirs(cache_path, exist_ok=True)
+            subprocess.run(["wget https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/eva_vit_g.pth -O ./model_zoo/LAVIS/eva_vit_g.pth"], shell=True)
+
+        accelerator.wait_for_everyone()
+        self._tokenizer, self._model, self.image_processor, self._max_length = load_pretrained_model(
+            self.model_path,
+            None,
+            self.model_name,
+            device_map=self.device_map,
+        )
+
+        self._config = self._model.config
+        self.model.eval()
+        self.model.tie_weights()
+        self.truncation = truncation
+        self.batch_size_per_gpu = int(batch_size)
+        self.conv_template = conv_template
+        self.use_cache = use_cache
+        self.truncate_context = truncate_context
+        # assert self.batch_size_per_gpu == 1, "Llava currently does not support batched generation. See https://github.com/haotian-liu/LLaVA/issues/754. HF Llava also has this issue."
+        if accelerator.num_processes > 1:
+            assert accelerator.distributed_type in [DistributedType.FSDP, DistributedType.MULTI_GPU, DistributedType.DEEPSPEED], "Unsupported distributed type provided. Only DDP and FSDP are supported."
+            # If you want to use DistributedType.DEEPSPEED, you have to run accelerate config before using the model
+            # Also, you have to select zero stage 0 (equivalent to DDP) in order to make the prepare model works
+            # I tried to set different parameters in the kwargs to let default zero 2 stage works, but it didn't work.
+            if accelerator.distributed_type == DistributedType.DEEPSPEED:
+                kwargs = {
+                    "train_micro_batch_size_per_gpu": self.batch_size_per_gpu,
+                    "train_batch_size": self.batch_size_per_gpu * accelerator.num_processes,
+                }
+                AcceleratorState().deepspeed_plugin.deepspeed_config_process(must_match=True, **kwargs)
+                eval_logger.info("Detected that you are using DistributedType.DEEPSPEED. Make sure you run `accelerate config` and set zero stage to 0")
+            if accelerator.distributed_type == DistributedType.FSDP or accelerator.distributed_type == DistributedType.DEEPSPEED:
+                self._model = accelerator.prepare(self.model)
+            else:
+                self._model = accelerator.prepare_model(self.model, evaluation_mode=True)
+            self.accelerator = accelerator
+            if self.accelerator.is_local_main_process:
+                eval_logger.info(f"Using {accelerator.num_processes} devices with data parallelism")
+            self._rank = self.accelerator.local_process_index
+            self._world_size = self.accelerator.num_processes
+        elif accelerator.num_processes == 1 and device_map == "auto":
+            eval_logger.info(f"Using {accelerator.num_processes} devices with tensor parallelism")
+            self._rank = 0
+            self._word_size = 1
+        else:
+            eval_logger.info(f"Using single device: {self._device}")
+            self.model.to(self._device)
+            self._rank = 0
+            self._world_size = 1
+
+    def download_file(self, url, folder_path):
+        # Create the folder if it doesn't exist
+        if not os.path.exists(folder_path):
+            os.makedirs(folder_path)
+
+        # Extract filename from URL
+        filename = url.split("/")[-1]
+
+        # Define path to save the file
+        file_path = os.path.join(folder_path, filename)
+
+        # Send a GET request to the URL
+        response = requests.get(url)
+
+        # Check if request was successful (status code 200)
+        if response.status_code == 200:
+            # Save the file to the specified folder
+            with open(file_path, "wb") as f:
+                f.write(response.content)
+            print(f"File downloaded successfully to {file_path}")
+        else:
+            print(f"Failed to download file. Status code: {response.status_code}")
+
+    @property
+    def config(self):
+        # return the associated transformers.AutoConfig for the given pretrained model.
+        return self._config
+
+    @property
+    def tokenizer(self):
+        return self._tokenizer
+
+    @property
+    def model(self):
+        # returns the model, unwrapping it if using Accelerate
+        if hasattr(self, "accelerator"):
+            return self.accelerator.unwrap_model(self._model)
+        else:
+            return self._model
+
+    @property
+    def eot_token_id(self):
+        # we use EOT because end of *text* is more accurate for what we're doing than end of *sentence*
+        return self.tokenizer.eos_token_id
+
+    @property
+    def max_length(self):
+        return self._max_length
+
+    def tok_encode(self, string: str, left_truncate_len=None, add_special_tokens=None) -> List[int]:
+        """ """
+        add_special_tokens = False if add_special_tokens is None else add_special_tokens
+        encoding = self.tokenizer.encode(string, add_special_tokens=add_special_tokens)
+        # left-truncate the encoded context to be at most `left_truncate_len` tokens long
+        if left_truncate_len:
+            encoding = encoding[-left_truncate_len:]
+        return encoding
+
+    def tok_decode(self, tokens):
+        return self.tokenizer.decode(tokens)
+
+    def load_video(self, video_path):
+        vr = VideoReader(video_path, ctx=cpu(0))
+        total_frame_num = len(vr)
+        fps = round(vr.get_avg_fps())
+        frame_idx = [i for i in range(0, len(vr), fps)]
+        spare_frames = vr.get_batch(frame_idx).asnumpy()
+        return spare_frames
+
+    def flatten(self, input):
+        new_list = []
+        for i in input:
+            for j in i:
+                new_list.append(j)
+        return new_list
+
+    def generate_until(self, requests) -> List[str]:
+        res = []
+        pbar = tqdm(total=len(requests), disable=(self.rank != 0), desc="Model Responding")
+
+        for contexts, gen_kwargs, doc_to_visual, doc_id, task, split in [reg.args for reg in requests]:
+            # encode, pad, and truncate contexts for this batch
+            visuals = [doc_to_visual(self.task_dict[task][split][doc_id])]
+            visuals = self.flatten(visuals)
+            videos = []
+            for visual in visuals:
+                video = read_video_pyav(visual, num_frm=self.num_frames)
+                video = self.image_processor.preprocess(video, return_tensors="pt")["pixel_values"].half().cuda()
+                video = [video]
+                videos += video
+            qs = contexts
+            if self.model.config.mm_use_im_start_end:
+                qs = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN + "\n" + qs
+            else:
+                qs = DEFAULT_IMAGE_TOKEN + "\n" + qs
+
+            conv = conv_templates[self.conv_template].copy()
+            conv.append_message(conv.roles[0], qs)
+            conv.append_message(conv.roles[1], None)
+            prompt = conv.get_prompt()
+
+            input_ids = tokenizer_image_token(prompt, self.tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).cuda()
+
+            stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
+            keywords = [stop_str]
+            stopping_criteria = KeywordsStoppingCriteria(keywords, self.tokenizer, input_ids)
+
+            cur_prompt = contexts
+            with torch.inference_mode():
+                self.model.update_prompt([[cur_prompt]])
+                output_ids = self.model.generate(input_ids, images=video, do_sample=True, temperature=0.2, max_new_tokens=1024, use_cache=True, stopping_criteria=[stopping_criteria])
+
+            input_token_len = input_ids.shape[1]
+            n_diff_input_output = (input_ids != output_ids[:, :input_token_len]).sum().item()
+            if n_diff_input_output > 0:
+                print(f"[Warning] {n_diff_input_output} output_ids are not the same as the input_ids")
+            outputs = self.tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
+            outputs = outputs.strip()
+            if outputs.endswith(stop_str):
+                outputs = outputs[: -len(stop_str)]
+            outputs = outputs.strip()
+            pbar.update(1)
+            res.append(outputs)
+
+        return res
+
+    def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]:
+        return super().loglikelihood(requests)
+
+    @property
+    def batch_size(self):
+        return self.batch_size_per_gpu
+
+    @property
+    def device(self):
+        return self._device
+
+    @property
+    def rank(self):
+        return self._rank
+
+    @property
+    def world_size(self):
+        return self._world_size
diff --git a/lmms_eval/models/llava.py b/lmms_eval/models/llava.py
old mode 100644
new mode 100755
index bd21bd33..b49cf55b
--- a/lmms_eval/models/llava.py
+++ b/lmms_eval/models/llava.py
@@ -16,6 +16,7 @@
 from accelerate import Accelerator, DistributedType, InitProcessGroupKwargs
 from accelerate.state import AcceleratorState
 from typing import List, Optional, Union, Tuple
+from packaging import version
 import warnings
 
 warnings.filterwarnings("ignore")
@@ -25,12 +26,16 @@
 try:
     from llava.model.builder import load_pretrained_model
     from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
-    from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
-    from llava.conversation import conv_templates, SeparatorStyle
-except ImportError:
-    eval_logger.error("LLaVA is not installed. Please install LLaVA to use this model.")
+    from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
+    from llava.conversation import conv_templates
+except Exception as e:
+    eval_logger.debug("LLaVA is not installed. Please install LLaVA to use this model.\nError: %s" % e)
 
-if torch.__version__ > "2.1.2":
+# inference implementation for attention, can be "sdpa", "eager", "flash_attention_2". Seems FA2 is not effective during inference: https://discuss.huggingface.co/t/flash-attention-has-no-effect-on-inference/73453/5
+# if is_flash_attn_2_available:
+#     best_fit_attn_implementation = "flash_attention_2" # flash_attn has a bug that says: ERROR Error query and key must have the same dtype in generating
+
+if version.parse(torch.__version__) >= version.parse("2.1.2"):
     best_fit_attn_implementation = "sdpa"
 else:
     best_fit_attn_implementation = "eager"
@@ -46,19 +51,15 @@ def __init__(
         self,
         pretrained: str = "liuhaotian/llava-v1.5-7b",
         truncation: Optional[bool] = True,
-        device: Optional[str] = "cuda",
-        dtype: Optional[Union[str, torch.dtype]] = "auto",
+        device: Optional[str] = "cuda:0",
         batch_size: Optional[Union[int, str]] = 1,
-        trust_remote_code: Optional[bool] = False,
-        revision=None,
         model_name=None,
         attn_implementation=best_fit_attn_implementation,
-        use_flash_attention_2=True,
-        device_map="auto",
+        device_map="cuda:0",
         conv_template="vicuna_v1",
         use_cache=True,
         truncate_context=False,  # whether to truncate the context in generation, set it False for LLaVA-1.6
-        customized_config=None,
+        customized_config=None,  # ends in json
         **kwargs,
     ) -> None:
         super().__init__()
@@ -67,32 +68,33 @@ def __init__(
 
         accelerator_kwargs = InitProcessGroupKwargs(timeout=timedelta(weeks=52))
         accelerator = Accelerator(kwargs_handlers=[accelerator_kwargs])
-        if accelerator.num_processes > 1 and device_map == "":
+        if accelerator.num_processes > 1:
             self._device = torch.device(f"cuda:{accelerator.local_process_index}")
             self.device_map = f"cuda:{accelerator.local_process_index}"
-        else:
+        elif accelerator.num_processes == 1 and device_map == "auto":
             self._device = torch.device(device)
             self.device_map = device_map
+        else:
+            self._device = torch.device(f"cuda:{accelerator.local_process_index}")
+            self.device_map = f"cuda:{accelerator.local_process_index}"
 
-        llava_model_args = {}
-        llava_model_args["attn_implementation"] = attn_implementation
-        if customized_config:
+        llava_model_args = {
+            "multimodal": True,
+        }
+        if customized_config is not None:
             llava_model_args["customized_config"] = customized_config
         if attn_implementation is not None:
             llava_model_args["attn_implementation"] = attn_implementation
         if "use_flash_attention_2" in kwargs:
             llava_model_args["use_flash_attention_2"] = kwargs["use_flash_attention_2"]
-
         model_name = model_name if model_name is not None else get_model_name_from_path(pretrained)
         try:
             # Try to load the model with the multimodal argument
             self._tokenizer, self._model, self._image_processor, self._max_length = load_pretrained_model(pretrained, None, model_name, device_map=self.device_map, **llava_model_args)
         except TypeError:
-            # for older versions of LLaVA that don't have multimodal and attn_implementation arguments
+            # for older versions of LLaVA that don't have multimodal argument
             llava_model_args.pop("multimodal", None)
-            llava_model_args.pop("attn_implementation", None)
             self._tokenizer, self._model, self._image_processor, self._max_length = load_pretrained_model(pretrained, None, model_name, device_map=self.device_map, **llava_model_args)
-
         self._config = self._model.config
         self.model.eval()
         self.model.tie_weights()
@@ -102,7 +104,7 @@ def __init__(
         self.use_cache = use_cache
         self.truncate_context = truncate_context
         # assert self.batch_size_per_gpu == 1, "Llava currently does not support batched generation. See https://github.com/haotian-liu/LLaVA/issues/754. HF Llava also has this issue."
-        if accelerator.num_processes > 1 and device_map == "":
+        if accelerator.num_processes > 1:
             assert accelerator.distributed_type in [DistributedType.FSDP, DistributedType.MULTI_GPU, DistributedType.DEEPSPEED], "Unsupported distributed type provided. Only DDP and FSDP are supported."
             # If you want to use DistributedType.DEEPSPEED, you have to run accelerate config before using the model
             # Also, you have to select zero stage 0 (equivalent to DDP) in order to make the prepare model works
@@ -194,7 +196,10 @@ def tok_encode(self, string: str, left_truncate_len=None, add_special_tokens=Non
         return encoding
 
     def tok_decode(self, tokens):
-        return self.tokenizer.decode(tokens)
+        try:
+            return self.tokenizer.decode(tokens)
+        except:
+            return self.tokenizer.decode([tokens])
 
     def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]:
         # TODO
@@ -209,6 +214,7 @@ def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]:
                 continuation = doc_to_target(self.task_dict[task][split][doc_id])
             visuals = [doc_to_visual(self.task_dict[task][split][doc_id])]
             visuals = self.flatten(visuals)
+            image_sizes = [[visual.size[0], visual.size[1]] for visual in visuals]
             if visuals:
                 image = process_images(visuals, self._image_processor, self._config)
                 if type(image) is list:
@@ -250,7 +256,7 @@ def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]:
             # Context part no need to calculate for loss
             labels[0, : contxt_id.shape[1]] = -100
             with torch.inference_mode():
-                outputs = self.model(input_ids=input_ids, labels=labels, images=image, use_cache=True)
+                outputs = self.model(input_ids=input_ids, labels=labels, images=image, use_cache=True, image_sizes=image_sizes)
             loss = outputs["loss"]
             # loss = torch.exp(loss)
             logits = outputs["logits"]
@@ -294,8 +300,8 @@ def _collate(x):
             contexts, all_gen_kwargs, doc_to_visual, doc_id, task, split = zip(*chunk)
             task = task[0]
             split = split[0]
-            visuals = [doc_to_visual[0](self.task_dict[task][split][ids]) for ids in doc_id]
-            visuals = self.flatten(visuals)
+            batched_visuals = [doc_to_visual[0](self.task_dict[task][split][ids]) for ids in doc_id]  # [B, N]
+            flattened_visuals = self.flatten(batched_visuals)  # [B*N]
             # we assume all gen kwargs in the batch are the same
             # this is safe to assume because the `grouper` object ensures it.
             gen_kwargs = all_gen_kwargs[0]
@@ -316,8 +322,8 @@ def _collate(x):
                 self._config.image_aspect_ratio = gen_kwargs.pop("image_aspect_ratio")
                 eval_logger.info(f"Setting image aspect ratio: {self._config.image_aspect_ratio}")
             # encode, pad, and truncate contexts for this batch
-            if visuals:
-                image_tensor = process_images(visuals, self._image_processor, self._config)
+            if flattened_visuals:
+                image_tensor = process_images(flattened_visuals, self._image_processor, self._config)
                 if type(image_tensor) is list:
                     image_tensor = [_image.to(dtype=torch.float16, device=self.device) for _image in image_tensor]
                 else:
@@ -329,7 +335,7 @@ def _collate(x):
 
             question_input = []
 
-            for visual, context in zip(visuals, contexts):
+            for visual, context in zip(batched_visuals, contexts):
                 if image_tensor is not None and len(image_tensor) != 0 and DEFAULT_IMAGE_TOKEN not in context:
                     """
                     Three senarios:
@@ -342,7 +348,6 @@ def _collate(x):
                     question = image_tokens + "\n" + context
                 else:
                     question = context
-
                 # This is much safer for llama3, as we now have some object type in it
                 if "llama_3" in self.conv_template:
                     conv = copy.deepcopy(conv_templates[self.conv_template])
@@ -356,7 +361,7 @@ def _collate(x):
             # The above for loop has bugs. When there is no visuals, e.g. pure text,
             # there will be no for loop execute resulting in an empty question_input (because no visuals)
             # Scenario 1 won't even be execute
-            if len(visuals) == 0:
+            if len(flattened_visuals) == 0:
                 for context in contexts:
                     question = context
                     conv = conv_templates[self.conv_template].copy()
@@ -367,7 +372,7 @@ def _collate(x):
 
             # input_ids = tokenizer_image_token(prompt, self.tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(self.device)
             # preconfigure gen_kwargs with defaults
-            gen_kwargs["image_sizes"] = [visuals[idx].size for idx in range(len(visuals))]
+            gen_kwargs["image_sizes"] = [flattened_visuals[idx].size for idx in range(len(flattened_visuals))]
             if "max_new_tokens" not in gen_kwargs:
                 gen_kwargs["max_new_tokens"] = 1024
             if "temperature" not in gen_kwargs:
@@ -382,7 +387,7 @@ def _collate(x):
             input_ids = self.pad_sequence(input_ids_list, batch_first=True, padding_value=pad_token_ids).to(self.device)
             attention_masks = input_ids.ne(pad_token_ids).to(self.device)
             # These steps are not in LLaVA's original code, but are necessary for generation to work
-            # TODO: pay attention to this major generation step...
+            # TODO: attention to this major generation step...
             try:
                 cont = self.model.generate(
                     input_ids,
@@ -399,6 +404,7 @@ def _collate(x):
                 )
                 text_outputs = self.tokenizer.batch_decode(cont, skip_special_tokens=True)
             except Exception as e:
+                raise e
                 eval_logger.error(f"Error {e} in generating")
                 cont = ""
                 text_outputs = [""]
diff --git a/lmms_eval/models/llava_sglang.py b/lmms_eval/models/llava_sglang.py
index 01f23535..47c67288 100644
--- a/lmms_eval/models/llava_sglang.py
+++ b/lmms_eval/models/llava_sglang.py
@@ -1,4 +1,5 @@
 import torch
+import random
 
 torch.backends.cuda.matmul.allow_tf32 = True
 
@@ -11,7 +12,6 @@
 from lmms_eval.api.model import lmms
 from lmms_eval.api.registry import register_model
 
-from accelerate import Accelerator, InitProcessGroupKwargs
 from typing import List, Optional, Union, Tuple
 import warnings
 
@@ -25,7 +25,7 @@
     import sglang as sgl
     from sglang.lang.chat_template import get_chat_template
 except ImportError:
-    eval_logger.error("SGLang is not installed. If you want to use llava_sglang, please install it using pip install 'sglang[all]' ")
+    eval_logger.debug("SGLang is not installed. If you want to use llava_sglang, please install it using pip install 'sglang[all]' ")
 
 if torch.__version__ > "2.1.2":
     best_fit_attn_implementation = "sdpa"
@@ -53,11 +53,11 @@ def __init__(
         self.tokenizer = tokenizer
         self.tp_size = tp_size
         self.conv_template = conv_template
-        torch.multiprocessing.set_start_method("spawn")
+        # torch.multiprocessing.set_start_method("spawn")
 
-        accelerator_kwargs = InitProcessGroupKwargs(timeout=timedelta(weeks=52))
-        accelerator = Accelerator(kwargs_handlers=[accelerator_kwargs])
-        assert accelerator.num_processes == 1, "Llava-sglang does not support multi-processes yet (it does support tensor parallelism)."
+        # accelerator_kwargs = InitProcessGroupKwargs(timeout=timedelta(weeks=52))
+        # accelerator = Accelerator(kwargs_handlers=[accelerator_kwargs])
+        # assert accelerator.num_processes == 1, "Llava-sglang does not support multi-processes yet (it does support tensor parallelism)."
         self._rank = 0
         self._world_size = 1
         self.parallel = parallel
@@ -66,8 +66,8 @@ def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]:
         raise NotImplementedError("Llava-sglang does not support loglikelihood evaluation yet")
 
     def generate_until(self, requests: List[Instance]) -> List[str]:
-
-        runtime = sgl.Runtime(model_path=self.pretrained, tokenizer_path=self.tokenizer, tp_size=self.tp_size)
+        torch.multiprocessing.set_start_method("spawn", force=True)
+        runtime = sgl.Runtime(model_path=self.pretrained, tokenizer_path=self.tokenizer, tp_size=self.tp_size, port=random.randint(10000, 50000))
         runtime.endpoint.chat_template = get_chat_template(self.conv_template)
         sgl.set_default_backend(runtime)
 
@@ -109,9 +109,6 @@ def _collate(x):
                 gen_kwargs["top_p"] = 1.0
             if "num_beams" not in gen_kwargs:
                 gen_kwargs["num_beams"] = 1
-            if gen_kwargs["top_p"] == 0.0:
-                gen_kwargs["top_p"] = 1.0
-                gen_kwargs["temperature"] = 0.0
             assert gen_kwargs["num_beams"] == 1
 
             def save_image_to_temp_file(image):
diff --git a/lmms_eval/models/llava_vid.py b/lmms_eval/models/llava_vid.py
new file mode 100755
index 00000000..abd42c36
--- /dev/null
+++ b/lmms_eval/models/llava_vid.py
@@ -0,0 +1,419 @@
+import logging
+from accelerate import Accelerator, DistributedType, InitProcessGroupKwargs
+from accelerate.state import AcceleratorState
+from typing import List, Optional, Union, Tuple
+import torch
+from tqdm import tqdm
+from decord import VideoReader, cpu
+import numpy as np
+import math
+from datetime import timedelta
+from transformers import AutoConfig
+import copy
+
+from lmms_eval.api.instance import Instance
+from lmms_eval.api.model import lmms
+from lmms_eval.api.registry import register_model
+from lmms_eval.models.model_utils.load_video import read_video_pyav
+
+eval_logger = logging.getLogger("lmms-eval")
+import sys
+
+sys.path.append("llava-video")
+try:
+    from llavavid.model.language_model.llava_llama import LlavaConfig
+
+    # from llavavid.model.language_model.llava_qwen import LlavaQwenConfig
+    from llavavid.model.builder import load_pretrained_model
+    from llavavid.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria
+    from llavavid.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
+    from llavavid.conversation import conv_templates, SeparatorStyle
+
+    # AutoConfig.register("llava_qwen", LlavaQwenConfig)
+    AutoConfig.register("llava_llama", LlavaConfig)
+
+except ImportError:
+    eval_logger.debug("LLaVA-Video is not installed. Please install LLaVA-Video to use this model.")
+
+try:
+    from llavavid.model.language_model.llava_qwen import LlavaQwenConfig
+
+    AutoConfig.register("llava_qwen", LlavaQwenConfig)
+except:
+    eval_logger.debug("")
+
+
+@register_model("llavavid")
+class LlavaVid(lmms):
+    """
+    LlavaVid Model
+    """
+
+    def __init__(
+        self,
+        pretrained: str = "liuhaotian/llava-v1.5-7b",
+        truncation: Optional[bool] = True,
+        device: Optional[str] = "cuda:0",
+        batch_size: Optional[Union[int, str]] = 1,
+        attn_implementation=(
+            "sdpa" if torch.__version__ >= "2.1.2" else "eager"
+        ),  # inference implementation for attention, can be "sdpa", "eager", "flash_attention_2". Seems FA2 is not effective during inference: https://discuss.huggingface.co/t/flash-attention-has-no-effect-on-inference/73453/5
+        device_map="cuda:0",
+        conv_template="vicuna_v1",
+        use_cache=True,
+        truncate_context=False,  # whether to truncate the context in generation, set it False for LLaVA-1.6
+        max_frames_num: int = 3,
+        mm_resampler_type: str = "spatial_pool",
+        mm_spatial_pool_stride: int = 2,
+        mm_spatial_pool_out_channels: int = 1024,
+        mm_spatial_pool_mode: str = "average",
+        overwrite: bool = True,
+        video_decode_backend: str = "pyav",
+        **kwargs,
+    ) -> None:
+        super().__init__()
+        assert kwargs == {}, f"Unexpected kwargs: {kwargs}"
+
+        accelerator_kwargs = InitProcessGroupKwargs(timeout=timedelta(weeks=52))
+        accelerator = Accelerator(kwargs_handlers=[accelerator_kwargs])
+        if accelerator.num_processes > 1:
+            self._device = torch.device(f"cuda:{accelerator.local_process_index}")
+            self.device_map = f"cuda:{accelerator.local_process_index}"
+        elif accelerator.num_processes == 1 and device_map == "auto":
+            self._device = torch.device(device)
+            self.device_map = device_map
+        else:
+            self._device = torch.device(f"cuda:{accelerator.local_process_index}")
+            self.device_map = f"cuda:{accelerator.local_process_index}"
+
+        self.pretrained = pretrained
+        self.model_name = get_model_name_from_path(pretrained)
+        self.video_decode_backend = video_decode_backend
+        # self._config = AutoConfig.from_pretrained(self.pretrained)
+        self.overwrite = overwrite
+        self.mm_resampler_type = mm_resampler_type
+        self.mm_spatial_pool_stride = int(mm_spatial_pool_stride)
+        self.mm_spatial_pool_out_channels = int(mm_spatial_pool_out_channels)
+        self.mm_spatial_pool_mode = mm_spatial_pool_mode
+        self.max_frames_num = int(max_frames_num)
+        if self.overwrite == True:
+            overwrite_config = {}
+            overwrite_config["mm_resampler_type"] = self.mm_resampler_type
+            overwrite_config["mm_spatial_pool_stride"] = self.mm_spatial_pool_stride
+            overwrite_config["mm_spatial_pool_out_channels"] = self.mm_spatial_pool_out_channels
+            overwrite_config["mm_spatial_pool_mode"] = self.mm_spatial_pool_mode
+            overwrite_config["mm_resampler_location"] = "before"
+            overwrite_config["patchify_video_feature"] = False
+            overwrite_config["attn_implementation"] = attn_implementation
+
+            cfg_pretrained = AutoConfig.from_pretrained(self.pretrained)
+
+            if cfg_pretrained.architectures[0] == "LlavaLlamaForCausalLM":  # Ugly code, only used in  vicuna that needs ROPE
+                if "224" in cfg_pretrained.mm_vision_tower:
+                    least_token_number = self.max_frames_num * (16 // self.mm_spatial_pool_stride) ** 2 + 1000
+                else:
+                    least_token_number = self.max_frames_num * (24 // self.mm_spatial_pool_stride) ** 2 + 1000
+
+                scaling_factor = math.ceil(least_token_number / 4096)
+                if scaling_factor >= 2:
+                    overwrite_config["rope_scaling"] = {"factor": float(scaling_factor), "type": "linear"}
+                    overwrite_config["max_sequence_length"] = 4096 * scaling_factor
+                    overwrite_config["tokenizer_model_max_length"] = 4096 * scaling_factor
+
+            if "v1.5" in pretrained:  # A hardcode solution here to load v1.5 model, otherwise it will use LlavaConfig from hf transformers
+                from transformers import AutoTokenizer
+                from llavavid.model.language_model.llava_llama import LlavaConfig, LlavaLlamaForCausalLM
+
+                self._tokenizer = AutoTokenizer.from_pretrained(pretrained, use_fast=False)
+                cfg_pretrained = LlavaConfig.from_pretrained(pretrained)
+                if overwrite_config is not None:
+                    print(f"Overwriting config with {overwrite_config}")
+                    for k, v in overwrite_config.items():
+                        setattr(cfg_pretrained, k, v)
+                kwargs["torch_dtype"] = torch.float16
+                self._model = LlavaLlamaForCausalLM.from_pretrained(pretrained, low_cpu_mem_usage=True, config=cfg_pretrained, device_map=self.device_map, **kwargs)
+                vision_tower = self._model.get_vision_tower()
+                if not vision_tower.is_loaded:
+                    vision_tower.load_model(device_map=self.device_map)
+                if self.device_map != "auto":
+                    vision_tower.to(device="cuda", dtype=torch.float16)
+                self._image_processor = vision_tower.image_processor
+
+                if hasattr(self._model.config, "max_sequence_length"):
+                    self._max_length = self._model.config.max_sequence_length
+                else:
+                    self._max_length = 2048
+            else:
+                self._tokenizer, self._model, self._image_processor, self._max_length = load_pretrained_model(pretrained, None, self.model_name, device_map=self.device_map, overwrite_config=overwrite_config)
+        else:
+            self._tokenizer, self._model, self._image_processor, self._max_length = load_pretrained_model(
+                pretrained,
+                None,
+                self.model_name,
+                device_map=self.device_map,
+            )
+
+        self._config = self._model.config
+        self.model.eval()
+        self.model.tie_weights()
+        self.truncation = truncation
+        self.batch_size_per_gpu = int(batch_size)
+        self.conv_template = conv_template
+        self.use_cache = use_cache
+        self.truncate_context = truncate_context
+        # assert self.batch_size_per_gpu == 1, "Llava currently does not support batched generation. See https://github.com/haotian-liu/LLaVA/issues/754. HF Llava also has this issue."
+        if accelerator.num_processes > 1:
+            assert accelerator.distributed_type in [DistributedType.FSDP, DistributedType.MULTI_GPU, DistributedType.DEEPSPEED], "Unsupported distributed type provided. Only DDP and FSDP are supported."
+            # If you want to use DistributedType.DEEPSPEED, you have to run accelerate config before using the model
+            # Also, you have to select zero stage 0 (equivalent to DDP) in order to make the prepare model works
+            # I tried to set different parameters in the kwargs to let default zero 2 stage works, but it didn't work.
+            if accelerator.distributed_type == DistributedType.DEEPSPEED:
+                kwargs = {
+                    "train_micro_batch_size_per_gpu": self.batch_size_per_gpu,
+                    "train_batch_size": self.batch_size_per_gpu * accelerator.num_processes,
+                }
+                AcceleratorState().deepspeed_plugin.deepspeed_config_process(must_match=True, **kwargs)
+                eval_logger.info("Detected that you are using DistributedType.DEEPSPEED. Make sure you run `accelerate config` and set zero stage to 0")
+            if accelerator.distributed_type == DistributedType.FSDP or accelerator.distributed_type == DistributedType.DEEPSPEED:
+                self._model = accelerator.prepare(self.model)
+            else:
+                self._model = accelerator.prepare_model(self.model, evaluation_mode=True)
+            self.accelerator = accelerator
+            if self.accelerator.is_local_main_process:
+                eval_logger.info(f"Using {accelerator.num_processes} devices with data parallelism")
+            self._rank = self.accelerator.local_process_index
+            self._world_size = self.accelerator.num_processes
+        elif accelerator.num_processes == 1 and device_map == "auto":
+            eval_logger.info(f"Using {accelerator.num_processes} devices with tensor parallelism")
+            self._rank = 0
+            self._word_size = 1
+        else:
+            eval_logger.info(f"Using single device: {self._device}")
+            self.model.to(self._device)
+            self._rank = 0
+            self._world_size = 1
+
+    @property
+    def config(self):
+        # return the associated transformers.AutoConfig for the given pretrained model.
+        return self._config
+
+    @property
+    def tokenizer(self):
+        return self._tokenizer
+
+    @property
+    def model(self):
+        # returns the model, unwrapping it if using Accelerate
+        if hasattr(self, "accelerator"):
+            return self.accelerator.unwrap_model(self._model)
+        else:
+            return self._model
+
+    @property
+    def eot_token_id(self):
+        # we use EOT because end of *text* is more accurate for what we're doing than end of *sentence*
+        return self.tokenizer.eos_token_id
+
+    @property
+    def max_length(self):
+        return self._max_length
+
+    def pad_sequence(self, input_ids, batch_first, padding_value):
+        if self.tokenizer.padding_side == "left":
+            input_ids = [torch.flip(_input_ids, [0]) for _input_ids in input_ids]
+        input_ids = torch.nn.utils.rnn.pad_sequence(input_ids, batch_first=batch_first, padding_value=padding_value)
+        if self.tokenizer.padding_side == "left":
+            input_ids = torch.flip(input_ids, [1])
+        return input_ids
+
+    @property
+    def batch_size(self):
+        return self.batch_size_per_gpu
+
+    @property
+    def device(self):
+        return self._device
+
+    @property
+    def rank(self):
+        return self._rank
+
+    @property
+    def world_size(self):
+        return self._world_size
+
+    def tok_encode(self, string: str, left_truncate_len=None, add_special_tokens=None) -> List[int]:
+        """ """
+        add_special_tokens = False if add_special_tokens is None else add_special_tokens
+        encoding = self.tokenizer.encode(string, add_special_tokens=add_special_tokens)
+        # left-truncate the encoded context to be at most `left_truncate_len` tokens long
+        if left_truncate_len:
+            encoding = encoding[-left_truncate_len:]
+        return encoding
+
+    def load_video(self, video_path, max_frames_num):
+        vr = VideoReader(video_path, ctx=cpu(0))
+        total_frame_num = len(vr)
+        # fps = round(vr.get_avg_fps())
+        # frame_idx = [i for i in range(0, len(vr), fps)]
+        uniform_sampled_frames = np.linspace(0, total_frame_num - 1, max_frames_num, dtype=int)
+        frame_idx = uniform_sampled_frames.tolist()
+        spare_frames = vr.get_batch(frame_idx).asnumpy()
+        return spare_frames  # (frames, height, width, channels)
+
+    def tok_decode(self, tokens):
+        return self.tokenizer.decode(tokens)
+
+    def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]:
+        res = []
+        pbar = tqdm(total=len(requests), disable=(self.rank != 0), desc="Model Responding")
+
+        for contexts, doc_to_target, doc_to_visual, doc_id, task, split in [reg.args for reg in requests]:
+            # encode, pad, and truncate contexts for this batch
+            if type(doc_to_target) == str:
+                continuation = doc_to_target
+            else:
+                continuation = doc_to_target(self.task_dict[task][split][doc_id])
+            visuals = [doc_to_visual(self.task_dict[task][split][doc_id])]
+            visuals = self.flatten(visuals)
+            videos = []
+            for visual in visuals:
+                video = self.load_video(visual, self.max_frames_num)
+                video = self._image_processor.preprocess(video, return_tensors="pt")["pixel_values"].half().cuda()
+                videos.append(video)
+
+            qs = contexts
+            if self.model.config.mm_use_im_start_end:
+                qs = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN + "\n" + qs
+            else:
+                qs = DEFAULT_IMAGE_TOKEN + "\n" + qs
+
+            conv = conv_templates[self.conv_template].copy()
+            conv.append_message(conv.roles[0], qs)
+            conv.append_message(conv.roles[1], None)
+            prompt = conv.get_prompt()
+
+            contxt_id = tokenizer_image_token(prompt, self.tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(self.device)
+
+            conv = conv_templates[self.conv_template].copy()
+            conv.append_message(conv.roles[0], qs)
+            conv.append_message(conv.roles[1], continuation)
+            prompt = conv.get_prompt()
+
+            input_ids = tokenizer_image_token(prompt, self.tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).cuda()
+            attention_masks = input_ids.ne(self.tokenizer.pad_token_id).long().cuda()
+
+            labels = input_ids.clone()
+            # Context part no need to calculate for loss
+            labels[0, : contxt_id.shape[1]] = -100
+
+            with torch.inference_mode():
+                outputs = self.model(input_ids=input_ids, labels=labels, images=videos, modalities="video")
+
+            loss = outputs["loss"]
+            # loss = torch.exp(loss)
+            logits = outputs["logits"]
+            greedy_tokens = logits.argmax(dim=-1)
+            cont_toks = input_ids[:, contxt_id.shape[1] :]  # [1, seq]
+            greedy_tokens = greedy_tokens[:, contxt_id.shape[1] : input_ids.shape[1]]  # [1, seq]
+            max_equal = (greedy_tokens == cont_toks).all()
+            res.append((float(loss.item()), bool(max_equal)))
+            pbar.update(1)
+        pbar.close()
+        return res
+
+    def flatten(self, input):
+        new_list = []
+        for i in input:
+            for j in i:
+                new_list.append(j)
+        return new_list
+
+    def generate_until(self, requests) -> List[str]:
+        res = []
+        pbar = tqdm(total=len(requests), disable=(self.rank != 0), desc="Model Responding")
+
+        for contexts, gen_kwargs, doc_to_visual, doc_id, task, split in [reg.args for reg in requests]:
+            # encode, pad, and truncate contexts for this batch
+            visuals = [doc_to_visual(self.task_dict[task][split][doc_id])]
+            visuals = self.flatten(visuals)
+            videos = []
+            try:
+                for visual in visuals:
+                    if self.video_decode_backend == "decord":
+                        video = self.load_video(visual, self.max_frames_num)
+                    elif self.video_decode_backend == "pyav":
+                        video = read_video_pyav(visual, num_frm=self.max_frames_num)
+                    # video = self.load_video(visual, self.max_frames_num)
+                    video = self._image_processor.preprocess(video, return_tensors="pt")["pixel_values"].half().cuda()
+                    videos.append(video)
+            except Exception as e:
+                eval_logger.info(f"{e}")
+                eval_logger.info(f"Video {visuals} can not load, check the source")
+                video_path = "\n".join(visuals)
+                res.append(f"Video {video_path} can not load, check the source")
+                pbar.update(1)
+                continue
+
+            qs = contexts
+            if self.model.config.mm_use_im_start_end:
+                qs = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN + "\n" + qs
+            else:
+                qs = DEFAULT_IMAGE_TOKEN + "\n" + qs
+
+            # This is much safer for llama3, as we now have some object type in it
+            if "llama_3" in self.conv_template:
+                conv = copy.deepcopy(conv_templates[self.conv_template])
+            else:
+                conv = conv_templates[self.conv_template].copy()
+
+            conv.append_message(conv.roles[0], qs)
+            conv.append_message(conv.roles[1], None)
+            prompt = conv.get_prompt()
+
+            input_ids = tokenizer_image_token(prompt, self.tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).cuda()
+            pad_token_ids = self.tokenizer.pad_token_id if self.tokenizer.pad_token_id is not None else self.tokenizer.eos_token_id
+            if "llama_3" in self.conv_template:
+                pad_token_ids = 0  # lmms-lab/llama3-llava-8b is trained on this pad token id. You may need to customize this for other models.
+            attention_masks = input_ids.ne(pad_token_ids).long().cuda()
+
+            # input_ids_list = [tokenizer_image_token(prompt, self.tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt") for prompt in question_input]
+            # pad_token_ids = self.tokenizer.pad_token_id if self.tokenizer.pad_token_id is not None else self.tokenizer.eos_token_id
+            # input_ids = self.pad_sequence(input_ids_list, batch_first=True, padding_value=pad_token_ids).to(self.device)
+            # attention_masks = input_ids.ne(pad_token_ids).to(self.device)
+
+            stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
+            keywords = [stop_str]
+            stopping_criteria = KeywordsStoppingCriteria(keywords, self.tokenizer, input_ids)
+
+            cur_prompt = contexts
+
+            if "max_new_tokens" not in gen_kwargs:
+                gen_kwargs["max_new_tokens"] = 1024
+            if "temperature" not in gen_kwargs:
+                gen_kwargs["temperature"] = 0.2
+            if "top_p" not in gen_kwargs:
+                gen_kwargs["top_p"] = None
+            if "num_beams" not in gen_kwargs:
+                gen_kwargs["num_beams"] = 1
+            with torch.inference_mode():
+                output_ids = self.model.generate(
+                    inputs=input_ids,
+                    images=videos,
+                    attention_mask=attention_masks,
+                    modalities="video",
+                    use_cache=self.use_cache,
+                    stopping_criteria=[stopping_criteria],
+                    do_sample=True if gen_kwargs["temperature"] > 0 else False,
+                    temperature=gen_kwargs["temperature"],
+                    top_p=gen_kwargs["top_p"],
+                    num_beams=gen_kwargs["num_beams"],
+                    max_new_tokens=gen_kwargs["max_new_tokens"],
+                )
+                # output_ids = model.generate(inputs=input_ids, images=video, attention_mask=attention_masks, modalities="video", do_sample=True, temperature=0.2, use_cache=True, stopping_criteria=[stopping_criteria])
+
+            outputs = self.tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
+            res.append(outputs)
+            pbar.update(1)
+        return res
diff --git a/lmms_eval/models/minicpm_v.py b/lmms_eval/models/minicpm_v.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/models/model_utils/__init__.py b/lmms_eval/models/model_utils/__init__.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/models/model_utils/load_video.py b/lmms_eval/models/model_utils/load_video.py
new file mode 100644
index 00000000..789039e7
--- /dev/null
+++ b/lmms_eval/models/model_utils/load_video.py
@@ -0,0 +1,55 @@
+import av
+from av.codec.context import CodecContext
+import numpy as np
+
+
+# This one is faster
+def record_video_length_stream(container, indices):
+    frames = []
+    start_index = indices[0]
+    end_index = indices[-1]
+    for i, frame in enumerate(container.decode(video=0)):
+        if i > end_index:
+            break
+        if i >= start_index and i in indices:
+            frames.append(frame)
+    return frames
+
+
+# This one works for all types of video
+def record_video_length_packet(container):
+    frames = []
+    # https://github.com/PyAV-Org/PyAV/issues/1269
+    # https://www.cnblogs.com/beyond-tester/p/17641872.html
+    # context = CodecContext.create("libvpx-vp9", "r")
+    for packet in container.demux(video=0):
+        for frame in packet.decode():
+            frames.append(frame)
+    return frames
+
+
+def read_video_pyav(video_path, num_frm=8):
+    
+    if "webm" not in video_path and "mkv" not in video_path:
+        # For mp4, we try loading with stream first
+        try:
+            container = av.open(video_path)
+            total_frames = container.streams.video[0].frames
+            sampled_frm = min(total_frames, num_frm)
+            indices = np.linspace(0, total_frames - 1, sampled_frm, dtype=int)
+            frames = record_video_length_stream(container, indices)
+        except:
+            container = av.open(video_path)
+            frames = record_video_length_packet(container)
+            total_frames = len(frames)
+            sampled_frm = min(total_frames, num_frm)
+            indices = np.linspace(0, total_frames - 1, sampled_frm, dtype=int)
+            frames = [frames[i] for i in indices]
+    else:
+        container = av.open(video_path)
+        frames = record_video_length_packet(container)
+        total_frames = len(frames)
+        sampled_frm = min(total_frames, num_frm)
+        indices = np.linspace(0, total_frames - 1, sampled_frm, dtype=int)
+        frames = [frames[i] for i in indices]
+    return np.stack([x.to_ndarray(format="rgb24") for x in frames])
diff --git a/lmms_eval/models/model_utils/qwen/qwen_generate_utils.py b/lmms_eval/models/model_utils/qwen/qwen_generate_utils.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/models/mplug_owl_video.py b/lmms_eval/models/mplug_owl_video.py
new file mode 100644
index 00000000..bfc52d23
--- /dev/null
+++ b/lmms_eval/models/mplug_owl_video.py
@@ -0,0 +1,194 @@
+import logging
+from accelerate import Accelerator, DistributedType, InitProcessGroupKwargs
+from accelerate.state import AcceleratorState
+from typing import List, Optional, Union, Tuple
+import torch
+from transformers import AutoTokenizer
+from tqdm import tqdm
+from datetime import timedelta
+
+from lmms_eval import utils
+from lmms_eval.api.instance import Instance
+from lmms_eval.api.model import lmms
+from lmms_eval.api.registry import register_model
+from lmms_eval.utils import stop_sequences_criteria
+
+from lmms_eval.models.mplug_owl_video.modeling_mplug_owl import MplugOwlForConditionalGeneration
+from lmms_eval.models.mplug_owl_video.processing_mplug_owl import MplugOwlImageProcessor, MplugOwlProcessor
+
+
+eval_logger = logging.getLogger("lmms-eval")
+
+
+@register_model("mplug_owl_video")
+class mplug_Owl(lmms):
+    def __init__(
+        self,
+        pretrained: str = "MAGAer13/mplug-owl-llama-7b-video",
+        device: Optional[str] = "cuda:0",
+        dtype: Optional[Union[str, torch.dtype]] = "auto",
+        batch_size: Optional[Union[int, str]] = 1,
+        device_map="cuda:0",
+        num_frames: Union[str, int] = 4,
+        **kwargs,
+    ) -> None:
+        """
+        Install instructions:
+        1. Install lmms-eval
+        cd lmms-eval
+        pip install -e .;
+        2. Install other packages with restricted versions
+        pip install av sentencepiece protobuf==3.20 transformers==4.28.1 einops;
+        """
+        super().__init__()
+
+        accelerator_kwargs = InitProcessGroupKwargs(timeout=timedelta(weeks=52))
+        accelerator = Accelerator(kwargs_handlers=[accelerator_kwargs])
+        if accelerator.num_processes > 1:
+            self._device = torch.device(f"cuda:{accelerator.local_process_index}")
+            self.device_map = f"cuda:{accelerator.local_process_index}"
+        elif accelerator.num_processes == 1 and device_map == "auto":
+            self._device = torch.device(device)
+            self.device_map = device_map
+        else:
+            self._device = torch.device(f"cuda:{accelerator.local_process_index}")
+            self.device_map = f"cuda:{accelerator.local_process_index}"
+
+        # import pdb; pdb.set_trace()
+        # This is very slow. Their issue, not mine
+        # Also, keep transformers in version 4.28.1
+        # They put a Config object inside a config object, this is not acceptable
+        # for transformers == 4.39.1, object type not serializable
+        # Protobuf needs to be in 3.20.x otherwise error
+        # ヽ(｀Д´)ﾉ
+        self._model = MplugOwlForConditionalGeneration.from_pretrained(
+            pretrained,
+            torch_dtype=torch.bfloat16,
+        )
+        self.image_processor = MplugOwlImageProcessor.from_pretrained(pretrained)
+        self._tokenizer = AutoTokenizer.from_pretrained(pretrained)
+        self.processor = MplugOwlProcessor(self.image_processor, self.tokenizer)
+        self.model.eval()
+        self.batch_size_per_gpu = batch_size
+        self.num_frames = num_frames
+
+        self.model.to(self.device)
+
+        if accelerator.num_processes > 1:
+            assert accelerator.distributed_type in [DistributedType.FSDP, DistributedType.MULTI_GPU, DistributedType.DEEPSPEED], "Unsupported distributed type provided. Only DDP and FSDP are supported."
+            # If you want to use DistributedType.DEEPSPEED, you have to run accelerate config before using the model
+            # Also, you have to select zero stage 0 (equivalent to DDP) in order to make the prepare model works
+            # I tried to set different parameters in the kwargs to let default zero 2 stage works, but it didn't work.
+            if accelerator.distributed_type == DistributedType.DEEPSPEED:
+                kwargs = {
+                    "train_micro_batch_size_per_gpu": self.batch_size_per_gpu,
+                    "train_batch_size": self.batch_size_per_gpu * accelerator.num_processes,
+                }
+                AcceleratorState().deepspeed_plugin.deepspeed_config_process(must_match=True, **kwargs)
+                eval_logger.info("Detected that you are using DistributedType.DEEPSPEED. Make sure you run `accelerate config` and set zero stage to 0")
+            if accelerator.distributed_type == DistributedType.FSDP or accelerator.distributed_type == DistributedType.DEEPSPEED:
+                self._model = accelerator.prepare(self.model)
+            else:
+                self._model = accelerator.prepare_model(self.model, evaluation_mode=True)
+            self.accelerator = accelerator
+            if self.accelerator.is_local_main_process:
+                eval_logger.info(f"Using {accelerator.num_processes} devices with data parallelism")
+            self._rank = self.accelerator.local_process_index
+            self._world_size = self.accelerator.num_processes
+        else:
+            eval_logger.info(f"Using single device: {self._device}")
+            self.model.to(self._device)
+            self._rank = 0
+            self._world_size = 1
+
+    @property
+    def config(self):
+        # return the associated transformers.AutoConfig for the given pretrained model.
+        return self._config
+
+    @property
+    def tokenizer(self):
+        return self._tokenizer
+
+    @property
+    def model(self):
+        # returns the model, unwrapping it if using Accelerate
+        if hasattr(self, "accelerator"):
+            return self.accelerator.unwrap_model(self._model)
+        else:
+            return self._model
+
+    @property
+    def eot_token_id(self):
+        # we use EOT because end of *text* is more accurate for what we're doing than end of *sentence*
+        return self.tokenizer.eos_token_id
+
+    @property
+    def max_length(self):
+        return self._max_length
+
+    @property
+    def batch_size(self):
+        return self.batch_size_per_gpu
+
+    @property
+    def device(self):
+        return self._device
+
+    @property
+    def rank(self):
+        return self._rank
+
+    @property
+    def world_size(self):
+        return self._world_size
+
+    def flatten(self, input):
+        new_list = []
+        for i in input:
+            for j in i:
+                new_list.append(j)
+        return new_list
+
+    def format_prompt(self, question):
+        prompts = [f" <|video|> Question : {question} Answer : "]
+        return prompts
+
+    def generate_until(self, requests) -> List[str]:
+        res = []
+        pbar = tqdm(total=len(requests), disable=(self.rank != 0), desc="Model Responding")
+
+        for contexts, gen_kwargs, doc_to_visual, doc_id, task, split in [reg.args for reg in requests]:
+            # encode, pad, and truncate contexts for this batch
+            visuals = [doc_to_visual(self.task_dict[task][split][doc_id])]
+            visuals = self.flatten(visuals)
+            inputs = self.processor(text=self.format_prompt(contexts), videos=visuals, num_frames=self.num_frames, return_tensors="pt")
+            pixel_values_videos = inputs["video_pixel_values"]
+            if pixel_values_videos.shape[2] != self.num_frames:
+                empty_frames = torch.zeros((1, pixel_values_videos.shape[1], self.num_frames - pixel_values_videos.shape[2], *pixel_values_videos.shape[3:]), dtype=pixel_values_videos.dtype)
+                pixel_values_videos = torch.cat([pixel_values_videos, empty_frames], dim=2)
+                inputs["video_pixel_values"] = pixel_values_videos
+            inputs = {k: v.bfloat16() if v.dtype == torch.float else v for k, v in inputs.items()}
+            inputs = {k: v.to(self.model.device) for k, v in inputs.items()}
+
+            if "max_new_tokens" in gen_kwargs:
+                gen_kwargs["max_length"] = gen_kwargs["max_new_tokens"]
+            if "max_new_tokens" not in gen_kwargs:
+                gen_kwargs["max_length"] = 128
+            if "do_sample" not in gen_kwargs:
+                gen_kwargs["do_sample"] = False
+            if "top_k" not in gen_kwargs:
+                gen_kwargs["top_k"] = 1
+
+            generate_kwargs = {"do_sample": gen_kwargs["do_sample"], "top_k": gen_kwargs["top_k"], "max_length": gen_kwargs["max_length"]}
+
+            with torch.no_grad():
+                outputs = self.model.generate(**inputs, **generate_kwargs)
+            sentence = self.tokenizer.decode(outputs.tolist()[0], skip_special_tokens=True)
+            pbar.update(1)
+            res.append(sentence)
+        pbar.close()
+        return res
+
+    def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]:
+        return super().loglikelihood(requests)
diff --git a/lmms_eval/models/mplug_owl_video/__init__.py b/lmms_eval/models/mplug_owl_video/__init__.py
new file mode 100644
index 00000000..2020ad3a
--- /dev/null
+++ b/lmms_eval/models/mplug_owl_video/__init__.py
@@ -0,0 +1,77 @@
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import TYPE_CHECKING
+
+from transformers.utils import OptionalDependencyNotAvailable, _LazyModule, is_tokenizers_available, is_torch_available
+
+
+_import_structure = {
+    "configuration_mplug_owl": ["MPLUG_OWL_PRETRAINED_CONFIG_ARCHIVE_MAP", "MplugOwlConfig"],
+    "processing_mplug_owl": ["MplugOwlImageProcessor", "MplugOwlProcessor"],
+    "tokenization_mplug_owl": ["MplugOwlTokenizer"],
+}
+
+try:
+    if not is_tokenizers_available():
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    pass
+
+
+try:
+    if not is_torch_available():
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    pass
+else:
+    _import_structure["modeling_mplug_owl"] = [
+        "MPLUG_OWL_PRETRAINED_MODEL_ARCHIVE_LIST",
+        "MplugOwlForConditionalGeneration",
+        "MplugOwlModel",
+    ]
+
+
+if TYPE_CHECKING:
+    from .configuration_mplug_owl import MPLUG_OWL_PRETRAINED_CONFIG_ARCHIVE_MAP, MplugOwlConfig
+    from .tokenization_mplug_owl import MplugOwlTokenizer
+
+    try:
+        if not is_tokenizers_available():
+            raise OptionalDependencyNotAvailable()
+    except OptionalDependencyNotAvailable:
+        pass
+
+    try:
+        if not is_torch_available():
+            raise OptionalDependencyNotAvailable()
+    except OptionalDependencyNotAvailable:
+        pass
+    else:
+        from .modeling_mplug_owl import (
+            MPLUG_OWL_PRETRAINED_MODEL_ARCHIVE_LIST,
+            MplugOwlForConditionalGeneration,
+            MplugOwlModel,
+            MplugOwlPreTrainedModel,
+        )
+
+
+else:
+    import sys
+
+    sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
+
+from .configuration_mplug_owl import *
+from .modeling_mplug_owl import *
+from .processing_mplug_owl import *
+from .tokenization_mplug_owl import *
diff --git a/lmms_eval/models/mplug_owl_video/configuration_mplug_owl.py b/lmms_eval/models/mplug_owl_video/configuration_mplug_owl.py
new file mode 100644
index 00000000..6b5d458d
--- /dev/null
+++ b/lmms_eval/models/mplug_owl_video/configuration_mplug_owl.py
@@ -0,0 +1,289 @@
+# coding=utf-8
+# Copyright 2022 x-plug and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" MplugOwl model configuration """
+import copy
+import os
+from typing import Union
+
+from transformers.configuration_utils import PretrainedConfig
+from transformers.models.auto.modeling_auto import MODEL_FOR_CAUSAL_LM_MAPPING_NAMES
+from transformers.utils import logging
+from transformers.models.auto import CONFIG_MAPPING
+
+
+logger = logging.get_logger(__name__)
+
+MPLUG_OWL_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    "MAGAer13/mplug-owl-llama-7b": "https://huggingface.co/MAGAer13/mplug-owl-llama-7b/resolve/main/config.json",
+    # See all MplugOwl models at https://huggingface.co/models?filter=mplug_owl
+}
+
+
+class MplugOwlVisionConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`MplugOwlVisionModel`]. It is used to instantiate a
+     mPLUG-Owl vision encoder according to the specified arguments, defining the model architecture. Instantiating a
+     configuration defaults will yield a similar configuration to that of the mPLUG-Owl
+     [x-plug/x_plug-llama-7b](https://huggingface.co/x-plug/x_plug-llama-7b) architecture.
+
+     Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+     documentation from [`PretrainedConfig`] for more information.
+
+     Args:
+         hidden_size (`int`, *optional*, defaults to 768):
+             Dimensionality of the encoder layers and the pooler layer.
+         intermediate_size (`int`, *optional*, defaults to 3072):
+             Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+         num_hidden_layers (`int`, *optional*, defaults to 12):
+             Number of hidden layers in the Transformer encoder.
+         num_attention_heads (`int`, *optional*, defaults to 12):
+             Number of attention heads for each attention layer in the Transformer encoder.
+         image_size (`int`, *optional*, defaults to 224):
+             The size (resolution) of each image.
+         patch_size (`int`, *optional*, defaults to 32):
+             The size (resolution) of each patch.
+         hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`):
+             The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+             `"relu"`, `"selu"` and `"gelu_new"` ``"quick_gelu"` are supported.
+         layer_norm_eps (`float`, *optional*, defaults to 1e-5):
+             The epsilon used by the layer normalization layers.
+         attention_dropout (`float`, *optional*, defaults to 0.0):
+             The dropout ratio for the attention probabilities.
+         initializer_range (`float`, *optional*, defaults to 0.02):
+             The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+         initializer_factor (`float`, *optional*, defaults to 1):
+             A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
+             testing).
+
+
+     ```"""
+
+    model_type = "mplug_owl_vision_model"
+
+    def __init__(
+        self,
+        hidden_size=1024,
+        intermediate_size=4096,
+        projection_dim=768,
+        num_hidden_layers=24,
+        num_attention_heads=16,
+        num_channels=3,
+        image_size=224,
+        patch_size=14,
+        hidden_act="quick_gelu",
+        layer_norm_eps=1e-6,
+        attention_dropout=0.0,
+        initializer_range=0.02,
+        initializer_factor=1.0,
+        use_flash_attn=False,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.projection_dim = projection_dim
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.num_channels = num_channels
+        self.patch_size = patch_size
+        self.image_size = image_size
+        self.initializer_range = initializer_range
+        self.initializer_factor = initializer_factor
+        self.attention_dropout = attention_dropout
+        self.layer_norm_eps = layer_norm_eps
+        self.hidden_act = hidden_act
+        self.use_flash_attn = use_flash_attn
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
+        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
+
+        # get the vision config dict if we are loading from MplugOwlConfig
+        if config_dict.get("model_type") == "mplug-owl":
+            config_dict = config_dict["vision_config"]
+
+        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
+            logger.warning(f"You are using a model of type {config_dict['model_type']} to instantiate a model of type " f"{cls.model_type}. This is not supported for all configurations of models and can yield errors.")
+
+        return cls.from_dict(config_dict, **kwargs)
+
+
+class MplugOwlVisualAbstractorConfig(PretrainedConfig):
+    model_type = "mplug_owl_visual_abstract"
+
+    def __init__(
+        self,
+        hidden_size=1024,  #
+        num_hidden_layers=6,  #
+        num_attention_heads=16,  #
+        intermediate_size=4096,  #
+        attention_probs_dropout_prob=0.1,  #
+        initializer_range=0.02,
+        layer_norm_eps=1e-6,  #
+        encoder_hidden_size=1024,  #
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.initializer_range = initializer_range
+        self.layer_norm_eps = layer_norm_eps
+        self.encoder_hidden_size = encoder_hidden_size
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
+        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
+
+        # get the visual_abstractor config dict if we are loading from MplugOwlConfig
+        if config_dict.get("model_type") == "mplug-owl":
+            config_dict = config_dict["abstractor_config"]
+
+        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
+            logger.warning(f"You are using a model of type {config_dict['model_type']} to instantiate a model of type " f"{cls.model_type}. This is not supported for all configurations of models and can yield errors.")
+
+        return cls.from_dict(config_dict, **kwargs)
+
+
+class MplugOwlConfig(PretrainedConfig):
+    r"""
+    [`MplugOwlConfig`] is the configuration class to store the configuration of a [`MplugOwlForConditionalGeneration`]. It is
+    used to instantiate a mPLUG-Owl model according to the specified arguments, defining the vision model, Q-Former model
+    and language model configs. Instantiating a configuration with the defaults will yield a similar configuration to
+    that of the mPLUG-Owl [x-plug/x_plug-llama-7b](https://huggingface.co/x-plug/x_plug-llama-7b) architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+        vision_config (`dict`, *optional*):
+            Dictionary of configuration options used to initialize [`MplugOwlVisionConfig`].
+        visual_abstractor_config (`dict`, *optional*):
+            Dictionary of configuration options used to initialize [`MplugOwlVisualAbstractorConfig`].
+        text_config (`dict`, *optional*):
+            Dictionary of configuration options used to initialize any [`PretrainedConfig`].
+        num_query_tokens (`int`, *optional*, defaults to 32):
+            The number of query tokens passed through the Transformer.
+
+        kwargs (*optional*):
+            Dictionary of keyword arguments.
+
+    Example:
+
+    ```python
+    >>> from transformers import (
+    ...     MplugOwlVisionConfig,
+    ...     MplugOwlVisualAbstractorConfig,
+    ...     OPTConfig,
+    ...     MplugOwlConfig,
+    ...     MplugOwlForConditionalGeneration,
+    ... )
+
+    >>> # Initializing a MplugOwlConfig with x-plug/x_plug-llama-7b style configuration
+    >>> configuration = MplugOwlConfig()
+
+    >>> # Initializing a MplugOwlForConditionalGeneration (with random weights) from the x-plug/x_plug-llama-7b style configuration
+    >>> model = MplugOwlForConditionalGeneration(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+
+    >>> # We can also initialize a MplugOwlConfig from a MplugOwlVisionConfig, MplugOwlVisualAbstractorConfig and any PretrainedConfig
+
+    >>> # Initializing mPLUG-Owl vision, mPLUG-Owl Q-Former and language model configurations
+    >>> vision_config = MplugOwlVisionConfig()
+    >>> visual_abstractor_config = MplugOwlVisualAbstractorConfig()
+    >>> text_config = OPTConfig()
+
+    >>> config = MplugOwlConfig.from_text_vision_configs(vision_config, visual_abstractor_config, text_config)
+    ```"""
+
+    model_type = "mplug-owl"
+    is_composition = True
+
+    def __init__(self, vision_config=None, visual_abstractor_config=None, text_config=None, num_query_tokens=64, **kwargs):
+        super().__init__(**kwargs)
+        if vision_config is None:
+            vision_config = MplugOwlVisionConfig().to_dict()
+            logger.info("vision_config is None.")
+
+        if visual_abstractor_config is None:
+            visual_abstractor_config = {}
+            logger.info("abstractor_config is None. ")
+
+        if text_config is None:
+            # we use LLAMA 7b by default
+            from ..llama.configuration_llama import LlamaConfig
+
+            text_config = LlamaConfig(pad_token_id=2).to_dict()
+            logger.info("text_config is None.")
+
+        self.vision_config = MplugOwlVisionConfig(**vision_config)
+        self.visual_abstractor_config = MplugOwlVisualAbstractorConfig(**visual_abstractor_config)
+        # self.visual_abstractor_config.layer_norm_eps = 1e-6
+        text_model_type = text_config["model_type"] if "model_type" in text_config else "llama"
+        self.text_config = CONFIG_MAPPING[text_model_type](**text_config)
+
+        self.tie_word_embeddings = self.text_config.tie_word_embeddings
+        self.is_encoder_decoder = self.text_config.is_encoder_decoder
+
+        self.num_query_tokens = num_query_tokens
+        # self.visual_abstractor_config.encoder_hidden_size = self.vision_config.hidden_size
+        self.use_decoder_only_language_model = self.text_config.model_type in MODEL_FOR_CAUSAL_LM_MAPPING_NAMES
+        self.initializer_factor = 1.0
+        self.initializer_range = 0.02
+
+        for attr in dir(self.text_config):
+            if not hasattr(self, attr):
+                setattr(self, attr, getattr(self.text_config, attr))
+
+    @classmethod
+    def from_vision_visual_abstractor_text_configs(
+        cls,
+        vision_config: MplugOwlVisionConfig,
+        visual_abstractor_config: MplugOwlVisualAbstractorConfig,
+        text_config: PretrainedConfig,
+        **kwargs,
+    ):
+        r"""
+        Instantiate a [`MplugOwlConfig`] (or a derived class) from a mPLUG-Owl vision model, Q-Former and language model
+        configurations.
+
+        Returns:
+            [`MplugOwlConfig`]: An instance of a configuration object
+        """
+
+        return cls(
+            vision_config=vision_config.to_dict(),
+            visual_abstractor_config=visual_abstractor_config.to_dict(),
+            text_config=text_config.to_dict(),
+            **kwargs,
+        )
+
+    def to_dict(self):
+        """
+        Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`].
+
+        Returns:
+            `Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
+        """
+        output = copy.deepcopy(self.__dict__)
+        output["vision_config"] = self.vision_config.to_dict()
+        output["visual_abstractor_config"] = self.visual_abstractor_config.to_dict()
+        output["text_config"] = self.text_config.to_dict()
+        output["model_type"] = self.__class__.model_type
+        return output
diff --git a/lmms_eval/models/mplug_owl_video/modeling_mplug_owl.py b/lmms_eval/models/mplug_owl_video/modeling_mplug_owl.py
new file mode 100644
index 00000000..6c5b7592
--- /dev/null
+++ b/lmms_eval/models/mplug_owl_video/modeling_mplug_owl.py
@@ -0,0 +1,1841 @@
+# coding=utf-8
+# Copyright 2022 x-plug The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" PyTorch MplugOwl model. """
+
+import logging
+import math
+from typing import Any, Optional, Tuple, Union
+
+try:
+    from flash_attn.flash_attn_interface import flash_attn_unpadded_func
+
+    flash_attn_func = flash_attn_unpadded_func
+except:
+    flash_attn_func = None
+    print("Error importing flash_attn in mplug_owl. Please install flash-attn first.")
+import math
+from dataclasses import dataclass
+from typing import Any, Optional, Tuple, Union
+
+import torch
+import torch.utils.checkpoint
+from torch import nn
+import einops
+
+from transformers.modeling_outputs import BaseModelOutput, BaseModelOutputWithPooling, BaseModelOutputWithPastAndCrossAttentions
+from transformers.modeling_utils import PreTrainedModel
+from transformers.pytorch_utils import find_pruneable_heads_and_indices, prune_linear_layer
+from transformers.utils import (
+    ModelOutput,
+    add_start_docstrings,
+    add_start_docstrings_to_model_forward,
+    logging,
+    replace_return_docstrings,
+)
+from transformers.models.auto import AutoModelForCausalLM
+from .configuration_mplug_owl import MplugOwlConfig, MplugOwlVisionConfig, MplugOwlVisualAbstractorConfig
+
+
+logger = logging.get_logger(__name__)
+
+_CHECKPOINT_FOR_DOC = "MAGAer13/mplug-owl-llama-7b"
+_CONFIG_FOR_DOC = "MplugOwlConfig"
+
+
+MPLUG_OWL_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "MAGAer13/mplug-owl-llama-7b",
+    # See all MplugOwl models at https://huggingface.co/models?filter=mplug_owl
+]
+
+
+@dataclass
+class MplugOwlForConditionalGenerationModelOutput(ModelOutput):
+    """
+    Class defining the outputs of [`MPlugOwlForConditionalGeneration`].
+
+    Args:
+        loss (`torch.FloatTensor`, *optional*, returned when `labels` is provided, `torch.FloatTensor` of shape `(1,)`):
+            Language modeling loss from the language model.
+        logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
+            Prediction scores of the language modeling head of the language model.
+        vision_outputs (`BaseModelOutputWithPooling`):
+            Outputs of the vision encoder.
+
+        language_model_outputs (`CausalLMOutputWithPast` or `Seq2SeqLMOutput`):
+            Outputs of the language model.
+    """
+
+    loss: Optional[Tuple[torch.FloatTensor]] = None
+    logits: Optional[Tuple[torch.FloatTensor]] = None
+    vision_outputs: Optional[torch.FloatTensor] = None
+    language_model_outputs: Optional[Tuple[torch.FloatTensor]] = None
+
+    def to_tuple(self) -> Tuple[Any]:
+        return tuple(self[k] if k not in ["vision_outputs", "language_model_outputs"] else getattr(self, k).to_tuple() for k in self.keys())
+
+
+def get_ltor_masks_and_position_ids_from_embeddings(data):
+    """Build masks and position id for left to right model."""
+
+    # Extract batch size and sequence length.
+    micro_batch_size, seq_length = data.size()[:2]
+
+    # Attention mask (lower triangular).
+    att_mask_batch = 1
+    attention_mask = torch.tril(torch.ones((att_mask_batch, seq_length, seq_length), device=data.device)).view(att_mask_batch, 1, seq_length, seq_length)
+
+    # Loss mask.
+    loss_mask = torch.ones(data.size()[:2], dtype=torch.float, device=data.device)
+
+    # Position ids.
+    position_ids = torch.arange(seq_length, dtype=torch.long, device=data.device)
+    position_ids = position_ids.unsqueeze(0).expand_as(data[..., 0])
+
+    # Convert attention mask to binary:
+    attention_mask = attention_mask < 0.5
+
+    return attention_mask, loss_mask, position_ids
+
+
+class MplugOwlVisionEmbeddings(nn.Module):
+    def __init__(self, config: MplugOwlVisionConfig):
+        super().__init__()
+        self.config = config
+        self.hidden_size = config.hidden_size
+        self.image_size = config.image_size
+        self.patch_size = config.patch_size
+
+        self.cls_token = nn.Parameter(torch.randn(1, 1, self.hidden_size))
+
+        self.patch_embed = nn.Conv2d(
+            in_channels=3,
+            out_channels=self.hidden_size,
+            kernel_size=self.patch_size,
+            stride=self.patch_size,
+            bias=False,
+        )
+
+        self.num_patches = (self.image_size // self.patch_size) ** 2
+
+        self.position_embedding = nn.Parameter(torch.randn(1, self.num_patches + 1, self.hidden_size))
+
+        self.pre_layernorm = LayerNormFp32(self.hidden_size, eps=config.layer_norm_eps)
+
+    def forward(self, pixel_values: torch.FloatTensor) -> torch.Tensor:
+        # [B, C, T, H, W] or [B, C, H, W]
+        batch_size = pixel_values.size(0)
+        T = pixel_values.size(2) if pixel_values.dim() > 4 else 1
+        if T > 1:
+            pixel_values = einops.rearrange(pixel_values, "b c t h w -> (b t) c h w")
+        image_embeds = self.patch_embed(pixel_values)
+        image_embeds = image_embeds.flatten(2).transpose(1, 2)
+
+        class_embeds = self.cls_token.expand(batch_size * T, 1, -1).to(image_embeds.dtype)
+        embeddings = torch.cat([class_embeds, image_embeds], dim=1)
+        embeddings = embeddings + self.position_embedding[:, : embeddings.size(1)].to(image_embeds.dtype)
+        embeddings = self.pre_layernorm(embeddings)
+        embeddings = einops.rearrange(embeddings, "(b t) n d -> b t n d", b=batch_size)
+        return embeddings
+
+
+class LayerNormFp32(nn.LayerNorm):
+    """Subclass torch's LayerNorm to handle fp16 (by casting to float32 and back)."""
+
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+
+    def forward(self, x: torch.Tensor):
+        output = torch.nn.functional.layer_norm(
+            x.float(),
+            self.normalized_shape,
+            self.weight.float() if self.weight is not None else None,
+            self.bias.float() if self.bias is not None else None,
+            self.eps,
+        )
+        return output.type_as(x)
+
+
+class QuickGELU(nn.Module):
+    def forward(self, x: torch.Tensor):
+        return x * torch.sigmoid(1.702 * x)
+
+
+class MplugOwlVisionLocalTemporal(nn.Module):
+    def __init__(self, config):
+        super(MplugOwlVisionLocalTemporal, self).__init__()
+
+        self.image_size = config.image_size
+        self.patch_size = config.patch_size
+        self.num_patches = 1 + (self.image_size // self.patch_size) ** 2
+        self.hidden_size = config.hidden_size
+        d_bottleneck = self.hidden_size // 2
+
+        self.ln = LayerNormFp32(self.hidden_size)
+        self.down_proj = nn.Conv3d(self.hidden_size, d_bottleneck, kernel_size=1, stride=1, padding=0)
+        self.conv = nn.Conv3d(d_bottleneck, d_bottleneck, kernel_size=(3, 1, 1), stride=1, padding=(1, 0, 0), groups=d_bottleneck)
+        self.up_proj = nn.Conv3d(d_bottleneck, self.hidden_size, kernel_size=1, stride=1, padding=0)
+
+        nn.init.constant_(self.up_proj.weight, 0)
+        nn.init.constant_(self.up_proj.bias, 0)
+
+        self.activation_func = QuickGELU()
+
+    def forward(self, x):
+        # [b, t, s, c]
+        T = x.size(1)
+        H = int((self.num_patches - 1) ** 0.5)
+        cls_token, x = x[:, :, 0:1], x[:, :, 1:]
+        x = self.ln(x)
+        x = einops.rearrange(x, "b t (h w) c -> b c t h w", h=H)
+        x = self.down_proj(x)
+        if self.conv.weight.dtype == torch.bfloat16:
+            x = torch.nn.functional.conv3d(x.half(), self.conv.weight.half(), bias=self.conv.bias.half(), stride=1, padding=(1, 0, 0), groups=self.conv.weight.shape[0]).to(cls_token.dtype)
+        else:
+            x = self.conv(x)
+        x = self.activation_func(x)
+        x = self.up_proj(x)
+        x = einops.rearrange(x, "b c t h w -> b t (h w) c")
+        x = torch.cat([cls_token, x], dim=2)
+        return x
+
+
+class MplugOwlVisionAttention(nn.Module):
+    """Multi-headed attention from 'Attention Is All You Need' paper"""
+
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.hidden_size = config.hidden_size
+        self.num_heads = config.num_attention_heads
+        self.head_dim = self.hidden_size // self.num_heads
+        if self.head_dim * self.num_heads != self.hidden_size:
+            raise ValueError(f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size} and `num_heads`:" f" {self.num_heads}).")
+        self.scale = self.head_dim**-0.5
+        self.dropout = nn.Dropout(config.attention_dropout)
+
+        self.query_key_value = nn.Linear(self.hidden_size, 3 * self.hidden_size)
+        self.dense = nn.Linear(self.hidden_size, self.hidden_size)
+
+    def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
+        return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        head_mask: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
+        """Input shape: Batch x Time x Channel"""
+
+        bsz, seq_len, embed_dim = hidden_states.size()
+
+        mixed_qkv = self.query_key_value(hidden_states)
+
+        mixed_qkv = mixed_qkv.reshape(bsz, seq_len, self.num_heads, 3, embed_dim // self.num_heads).permute(3, 0, 2, 1, 4)  # [3, b, np, sq, hn]
+        query_states, key_states, value_states = (
+            mixed_qkv[0],
+            mixed_qkv[1],
+            mixed_qkv[2],
+        )
+        # if self.config.use_flash_attn and flash_attn_func is not None:
+        if False:
+            # [b*sq, np, hn]
+            query_states = query_states.permute(0, 2, 1, 3).contiguous()
+            query_states = query_states.view(query_states.size(0) * query_states.size(1), query_states.size(2), -1)
+
+            key_states = key_states.permute(0, 2, 1, 3).contiguous()
+            key_states = key_states.view(key_states.size(0) * key_states.size(1), key_states.size(2), -1)
+
+            value_states = value_states.permute(0, 2, 1, 3).contiguous()
+            value_states = value_states.view(value_states.size(0) * value_states.size(1), value_states.size(2), -1)
+
+            cu_seqlens = torch.arange(0, (bsz + 1) * seq_len, step=seq_len, dtype=torch.int32, device=query_states.device)
+
+            context_layer = flash_attn_func(
+                query_states,
+                key_states,
+                value_states,
+                cu_seqlens,
+                cu_seqlens,
+                seq_len,
+                seq_len,
+                self.dropout if self.training else 0.0,
+                softmax_scale=self.scale,
+                causal=False,
+                return_attn_probs=False,
+            )
+            # [b*sq, np, hn] => [b, sq, np, hn]
+            context_layer = context_layer.view(bsz, seq_len, context_layer.size(1), context_layer.size(2))
+        else:
+            # Take the dot product between "query" and "key" to get the raw attention scores.
+            attention_scores = torch.matmul(query_states, key_states.transpose(-1, -2))
+
+            attention_scores = attention_scores * self.scale
+
+            # Normalize the attention scores to probabilities.
+            attention_probs = torch.softmax(attention_scores, dim=-1)
+
+            # This is actually dropping out entire tokens to attend to, which might
+            # seem a bit unusual, but is taken from the original Transformer paper.
+            attention_probs = self.dropout(attention_probs)
+
+            # Mask heads if we want to
+            if head_mask is not None:
+                attention_probs = attention_probs * head_mask
+
+            context_layer = torch.matmul(attention_probs, value_states).permute(0, 2, 1, 3)
+
+        new_context_layer_shape = context_layer.size()[:-2] + (self.hidden_size,)
+        context_layer = context_layer.reshape(new_context_layer_shape)
+
+        output = self.dense(context_layer)
+
+        outputs = (output, attention_probs) if output_attentions else (output, None)
+
+        return outputs
+
+
+class MplugOwlMLP(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.activation_fn = QuickGELU()
+        self.fc1 = nn.Linear(config.hidden_size, config.intermediate_size)
+        self.fc2 = nn.Linear(config.intermediate_size, config.hidden_size)
+
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        hidden_states = self.fc1(hidden_states)
+        hidden_states = self.activation_fn(hidden_states)
+        hidden_states = self.fc2(hidden_states)
+        return hidden_states
+
+
+class MplugOwlVisionEncoderLayer(nn.Module):
+    def __init__(self, config: MplugOwlVisionConfig):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        self.temporal = MplugOwlVisionLocalTemporal(config)
+        self.self_attn = MplugOwlVisionAttention(config)
+        self.input_layernorm = LayerNormFp32(self.hidden_size, eps=config.layer_norm_eps)
+        self.mlp = MplugOwlMLP(config)
+        self.post_attention_layernorm = LayerNormFp32(self.hidden_size, eps=config.layer_norm_eps)
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: torch.Tensor,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple[torch.FloatTensor]:
+        """
+        Args:
+            hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, time, seq_len, embed_dim)`
+            attention_mask (`torch.FloatTensor`): attention mask of size
+                `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
+                `(config.encoder_attention_heads,)`.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+                returned tensors for more detail.
+        """
+        B, T = hidden_states.size(0), hidden_states.size(1)
+        if T > 1:
+            hidden_states = hidden_states + self.temporal(hidden_states)
+        hidden_states = einops.rearrange(hidden_states, "b t n d -> (b t) n d")
+
+        residual = hidden_states
+
+        hidden_states = self.input_layernorm(hidden_states)
+        hidden_states, attn_weights = self.self_attn(
+            hidden_states=hidden_states,
+            head_mask=attention_mask,
+            output_attentions=output_attentions,
+        )
+        hidden_states = hidden_states + residual
+        residual = hidden_states
+        hidden_states = self.post_attention_layernorm(hidden_states)
+        hidden_states = self.mlp(hidden_states)
+
+        hidden_states = hidden_states + residual
+        hidden_states = einops.rearrange(hidden_states, "(b t) n d -> b t n d", b=B)
+
+        outputs = (hidden_states,)
+
+        if output_attentions:
+            outputs += (attn_weights,)
+
+        return outputs
+
+
+class MplugOwlPreTrainedModel(PreTrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+
+    config_class = MplugOwlConfig
+    base_model_prefix = "mplug_owl"
+    supports_gradient_checkpointing = True
+    _keys_to_ignore_on_load_missing = [
+        r"position_ids",
+        r"language_model.encoder.embed_tokens.weight",
+        r"language_model.decoder.embed_tokens.weight",
+        r"language_model.lm_head.weight",
+    ]
+    _no_split_modules = [
+        "MplugOwlVisionEncoderLayer",
+        "LlamaDecoderLayer",
+        "MplugOwlVisualAbstractorLayer",
+        "LlamaForCausalLM",
+        "Parameter",
+    ]
+    _keep_in_fp32_modules = ["wo"]
+
+    def _init_weights(self, module):
+        """Initialize the weights"""
+        factor = self.config.initializer_range
+        if isinstance(module, nn.Conv2d) or isinstance(module, nn.Embedding) or isinstance(module, nn.Linear):
+            module.weight.data.normal_(mean=0.0, std=factor)
+            if hasattr(module, "bias") and module.bias is not None:
+                module.bias.data.zero_()
+
+        if isinstance(module, MplugOwlVisionEmbeddings):
+            if hasattr(self.config, "vision_config"):
+                factor = self.config.vision_config.initializer_range
+            nn.init.trunc_normal_(module.position_embedding, mean=0.0, std=factor)
+            nn.init.trunc_normal_(module.cls_token, mean=0.0, std=factor)
+
+        elif isinstance(module, nn.LayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+        elif isinstance(module, nn.Linear) and module.bias is not None:
+            module.bias.data.zero_()
+        elif isinstance(module, nn.Parameter):
+            raise ValueError
+            nn.init.trunc_normal_(module.data, mean=0.0, std=factor)
+
+    def _set_gradient_checkpointing(self, module, value=False):
+        if isinstance(module, MplugOwlVisionEncoder):
+            module.gradient_checkpointing = value
+
+
+MPLUG_OWL_START_DOCSTRING = r"""
+    This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
+    library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
+    etc.)
+
+    This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
+    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
+    and behavior.
+
+    Parameters:
+        config ([`MplugOwlConfig`]): Model configuration class with all the parameters of the model.
+            Initializing with a config file does not load the weights associated with the model, only the
+            configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
+"""
+
+MPLUG_OWL_VISION_INPUTS_DOCSTRING = r"""
+    Args:
+        pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
+            Pixel values. Pixel values can be obtained using [`MplugOwlProcessor`]. See [`MplugOwlProcessor.__call__`] for
+            details.
+        output_attentions (`bool`, *optional*):
+            Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+            tensors for more detail.
+        output_hidden_states (`bool`, *optional*):
+            Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+            more detail.
+        return_dict (`bool`, *optional*):
+            Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+"""
+
+MPLUG_OWL_TEXT_INPUTS_DOCSTRING = r"""
+    Args:
+        input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
+            Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
+            it. Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
+            [`PreTrainedTokenizer.__call__`] for details. [What are input IDs?](../glossary#input-ids)
+        attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+            [What are attention masks?](../glossary#attention-mask)
+        decoder_input_ids (`torch.LongTensor` of shape `(batch_size, target_sequence_length)`, *optional*):
+            Indices of decoder input sequence tokens in the vocabulary.
+
+            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
+            [`PreTrainedTokenizer.__call__`] for details.
+
+            [What are decoder input IDs?](../glossary#decoder-input-ids)
+
+            T5 uses the `pad_token_id` as the starting token for `decoder_input_ids` generation. If `past_key_values`
+            is used, optionally only the last `decoder_input_ids` have to be input (see `past_key_values`).
+
+            To know more on how to prepare `decoder_input_ids` for pretraining take a look at [T5
+            Training](./t5#training).
+        decoder_attention_mask (`torch.BoolTensor` of shape `(batch_size, target_sequence_length)`, *optional*):
+            Default behavior: generate a tensor that ignores pad tokens in `decoder_input_ids`. Causal mask will also
+            be used by default.
+        output_attentions (`bool`, *optional*):
+            Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+            tensors for more detail.
+        output_hidden_states (`bool`, *optional*):
+            Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+            more detail.
+        return_dict (`bool`, *optional*):
+            Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+"""
+
+MPLUG_OWL_INPUTS_DOCSTRING = r"""
+    Args:
+        pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
+            Pixel values. Pixel values can be obtained using [`MplugOwlProcessor`]. See [`MplugOwlProcessor.__call__`] for
+            details.
+
+        input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Indices of input sequence tokens in the vocabulary of the language model. Input tokens can optionally be
+            provided to serve as text prompt, which the language model can continue.
+
+            Indices can be obtained using [`MplugOwlProcessor`]. See [`MplugOwlProcessor.__call__`] for details.
+
+            [What are input IDs?](../glossary#input-ids)
+        attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+
+            [What are attention masks?](../glossary#attention-mask)
+
+        decoder_input_ids (`torch.LongTensor` of shape `(batch_size, target_sequence_length)`, *optional*):
+            Indices of decoder input sequence tokens in the vocabulary of the language model. Only relevant in case an
+            encoder-decoder language model (like T5) is used.
+
+            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
+            [`PreTrainedTokenizer.__call__`] for details. [What are decoder input IDs?](../glossary#decoder-input-ids)
+
+        decoder_attention_mask (`torch.BoolTensor` of shape `(batch_size, target_sequence_length)`, *optional*):
+            Default behavior: generate a tensor that ignores pad tokens in `decoder_input_ids`. Causal mask will also
+            be used by default.
+
+            Only relevant in case an encoder-decoder language model (like T5) is used.
+
+        output_attentions (`bool`, *optional*):
+            Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+            tensors for more detail.
+        output_hidden_states (`bool`, *optional*):
+            Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+            more detail.
+        return_dict (`bool`, *optional*):
+            Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+"""
+
+
+class MplugOwlVisionEncoder(nn.Module):
+    """
+    Transformer encoder consisting of `config.num_hidden_layers` self attention layers. Each layer is a
+    [`MplugOwlVisionEncoderLayer`].
+
+    Args:
+        config (`MplugOwlVisionConfig`):
+            The corresponding vision configuration for the `MplugOwlEncoder`.
+    """
+
+    def __init__(self, config: MplugOwlVisionConfig):
+        super().__init__()
+        self.config = config
+        self.layers = nn.ModuleList([MplugOwlVisionEncoderLayer(config) for _ in range(config.num_hidden_layers)])
+        self.gradient_checkpointing = False
+
+    def forward(
+        self,
+        inputs_embeds,
+        attention_mask: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutput]:
+        r"""
+        Args:
+            inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
+                Embedded representation of the inputs. Should be float, not int tokens.
+            attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+
+                - 1 for tokens that are **not masked**,
+                - 0 for tokens that are **masked**.
+
+                [What are attention masks?](../glossary#attention-mask)
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+                returned tensors for more detail.
+            output_hidden_states (`bool`, *optional*):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
+                for more detail.
+            return_dict (`bool`, *optional*):
+                Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        encoder_states = () if output_hidden_states else None
+        all_attentions = () if output_attentions else None
+
+        hidden_states = inputs_embeds
+        for idx, encoder_layer in enumerate(self.layers):
+            if output_hidden_states:
+                encoder_states = encoder_states + (hidden_states,)
+            if self.gradient_checkpointing and self.training:
+
+                def create_custom_forward(module):
+                    def custom_forward(*inputs):
+                        return module(*inputs, output_attentions)
+
+                    return custom_forward
+
+                layer_outputs = torch.utils.checkpoint.checkpoint(
+                    create_custom_forward(encoder_layer),
+                    hidden_states,
+                    attention_mask,
+                )
+            else:
+                layer_outputs = encoder_layer(
+                    hidden_states,
+                    attention_mask,
+                    output_attentions=output_attentions,
+                )
+
+            hidden_states = layer_outputs[0]
+
+            if output_attentions:
+                all_attentions = all_attentions + (layer_outputs[1],)
+
+        if output_hidden_states:
+            encoder_states = encoder_states + (hidden_states,)
+
+        if not return_dict:
+            return tuple(v for v in [hidden_states, encoder_states, all_attentions] if v is not None)
+        return BaseModelOutput(last_hidden_state=hidden_states, hidden_states=encoder_states, attentions=all_attentions)
+
+
+class MplugOwlVisionModel(MplugOwlPreTrainedModel):
+    main_input_name = "pixel_values"
+    config_class = MplugOwlVisionConfig
+
+    def __init__(self, config: MplugOwlVisionConfig):
+        super().__init__(config)
+        self.config = config
+        self.hidden_size = config.hidden_size
+
+        self.embeddings = MplugOwlVisionEmbeddings(config)
+        self.encoder = MplugOwlVisionEncoder(config)
+        self.post_layernorm = LayerNormFp32(self.hidden_size, eps=config.layer_norm_eps)
+
+        self.post_init()
+
+    @add_start_docstrings_to_model_forward(MPLUG_OWL_VISION_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=BaseModelOutputWithPooling, config_class=MplugOwlVisionConfig)
+    def forward(
+        self,
+        pixel_values: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPooling]:
+        r"""
+        Returns:
+
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if pixel_values is None:
+            raise ValueError("You have to specify pixel_values")
+
+        hidden_states = self.embeddings(pixel_values)  # [B, T, N, D]
+
+        encoder_outputs = self.encoder(
+            inputs_embeds=hidden_states,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        last_hidden_state = encoder_outputs[0]
+        last_hidden_state = self.post_layernorm(last_hidden_state)
+
+        pooled_output = last_hidden_state[:, :, 0, :].mean(1)
+        pooled_output = self.post_layernorm(pooled_output)
+
+        if not return_dict:
+            return (last_hidden_state, pooled_output) + encoder_outputs[1:]
+
+        return BaseModelOutputWithPooling(
+            last_hidden_state=last_hidden_state,
+            pooler_output=pooled_output,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+        )
+
+    def get_input_embeddings(self):
+        return self.embeddings
+
+
+class MplugOwlVisualAbstractorMLP(nn.Module):
+    def __init__(self, config: MplugOwlVisualAbstractorConfig):
+        super().__init__()
+        self.config = config
+        in_features = config.hidden_size
+        hidden_features = config.intermediate_size
+        if hidden_features != 2816:
+            hidden_features = int(2 * hidden_features / 3)
+            multiple_of = 256
+            hidden_features = multiple_of * ((hidden_features + multiple_of - 1) // multiple_of)
+        self.act = nn.SiLU()
+
+        self.w1 = nn.Linear(in_features, hidden_features)
+        self.w2 = nn.Linear(hidden_features, in_features)
+        self.w3 = nn.Linear(in_features, hidden_features)
+        self.ffn_ln = LayerNormFp32(hidden_features, eps=config.layer_norm_eps)
+
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        hidden_states = self.act(self.w1(hidden_states)) * self.w3(hidden_states)
+        hidden_states = self.ffn_ln(hidden_states)
+        hidden_states = self.w2(hidden_states)
+        return hidden_states
+
+
+class MplugOwlVisualAbstractorMultiHeadAttention(nn.Module):
+    def __init__(self, config: MplugOwlVisualAbstractorConfig):
+        super().__init__()
+        self.config = config
+        if config.hidden_size % config.num_attention_heads != 0:
+            raise ValueError("The hidden size (%d) is not a multiple of the number of attention heads (%d)" % (config.hidden_size, config.num_attention_heads))
+
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+
+        self.query = nn.Linear(config.hidden_size, self.all_head_size)
+        self.key = nn.Linear(config.encoder_hidden_size, self.all_head_size)
+        self.value = nn.Linear(config.encoder_hidden_size, self.all_head_size)
+
+        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
+        self.save_attention = False
+
+    def save_attn_gradients(self, attn_gradients):
+        self.attn_gradients = attn_gradients
+
+    def get_attn_gradients(self):
+        return self.attn_gradients
+
+    def save_attention_map(self, attention_map):
+        self.attention_map = attention_map
+
+    def get_attention_map(self):
+        return self.attention_map
+
+    def transpose_for_scores(self, x):
+        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
+        x = x.view(*new_x_shape)
+        return x.permute(0, 2, 1, 3)
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_value=None,
+        output_attentions=False,
+    ):
+        # If this is instantiated as a cross-attention module, the keys
+        # and values come from an encoder; the attention mask needs to be
+        # such that the encoder's padding tokens are not attended to.
+        key_layer = self.transpose_for_scores(self.key(encoder_hidden_states))
+        value_layer = self.transpose_for_scores(self.value(encoder_hidden_states))
+        attention_mask = encoder_attention_mask
+
+        mixed_query_layer = self.query(hidden_states)
+
+        query_layer = self.transpose_for_scores(mixed_query_layer)
+
+        past_key_value = (key_layer, value_layer)
+
+        # Take the dot product between "query" and "key" to get the raw attention scores.
+        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
+
+        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
+
+        if attention_mask is not None:
+            # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
+            attention_scores = attention_scores + attention_mask
+
+        # Normalize the attention scores to probabilities.
+        attention_probs = nn.Softmax(dim=-1)(attention_scores)
+
+        if self.save_attention:
+            self.save_attention_map(attention_probs)
+            attention_probs.register_hook(self.save_attn_gradients)
+
+        # This is actually dropping out entire tokens to attend to, which might
+        # seem a bit unusual, but is taken from the original Transformer paper.
+        attention_probs_dropped = self.dropout(attention_probs)
+
+        # Mask heads if we want to
+        if head_mask is not None:
+            attention_probs_dropped = attention_probs_dropped * head_mask
+
+        context_layer = torch.matmul(attention_probs_dropped, value_layer)
+
+        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
+        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
+        context_layer = context_layer.view(*new_context_layer_shape)
+
+        outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)
+
+        outputs = outputs + (past_key_value,)
+        return outputs
+
+
+class MplugOwlVisualAbstractorCrossOutput(nn.Module):
+    def __init__(self, config: MplugOwlVisualAbstractorConfig):
+        super().__init__()
+        dim = config.hidden_size
+        self.out_proj = nn.Linear(dim, dim, bias=True)
+        self.norm2 = LayerNormFp32(dim)
+        self.mlp = MplugOwlVisualAbstractorMLP(config)
+
+    def forward(self, hidden_states: torch.Tensor, input_tensor: torch.Tensor) -> torch.Tensor:
+        input_tensor = input_tensor + self.out_proj(hidden_states)
+        input_tensor = input_tensor + self.mlp(self.norm2(input_tensor))
+        return input_tensor
+
+
+class MplugOwlVisualAbstractorAttention(nn.Module):
+    def __init__(self, config: MplugOwlVisualAbstractorConfig):
+        super().__init__()
+        self.attention = MplugOwlVisualAbstractorMultiHeadAttention(config)
+        self.output = MplugOwlVisualAbstractorCrossOutput(config)
+        self.pruned_heads = set()
+        self.norm1 = LayerNormFp32(config.hidden_size)
+        self.normk = LayerNormFp32(config.hidden_size)
+
+    def prune_heads(self, heads):
+        if len(heads) == 0:
+            return
+        heads, index = find_pruneable_heads_and_indices(heads, self.attention.num_attention_heads, self.attention.attention_head_size, self.pruned_heads)
+
+        # Prune linear layers
+        self.attention.query = prune_linear_layer(self.attention.query, index)
+        self.attention.key = prune_linear_layer(self.attention.key, index)
+        self.attention.value = prune_linear_layer(self.attention.value, index)
+        self.output.dense = prune_linear_layer(self.output.out_proj, index, dim=1)
+
+        # Update hyper params and store pruned heads
+        self.attention.num_attention_heads = self.attention.num_attention_heads - len(heads)
+        self.attention.all_head_size = self.attention.attention_head_size * self.attention.num_attention_heads
+        self.pruned_heads = self.pruned_heads.union(heads)
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.FloatTensor] = None,
+        head_mask: Optional[torch.FloatTensor] = None,
+        encoder_hidden_states: Optional[torch.FloatTensor] = None,
+        encoder_attention_mask: Optional[torch.FloatTensor] = None,
+        past_key_value: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple[torch.Tensor]:
+        # HACK we apply norm on q and k
+        hidden_states = self.norm1(hidden_states)
+        encoder_hidden_states = self.normk(encoder_hidden_states)
+        encoder_hidden_states = torch.cat([hidden_states, encoder_hidden_states], dim=1)
+        encoder_attention_mask = torch.cat([attention_mask, encoder_attention_mask], dim=-1)
+        self_outputs = self.attention(
+            hidden_states,
+            attention_mask,
+            head_mask,
+            encoder_hidden_states,
+            encoder_attention_mask,
+            past_key_value,
+            output_attentions,
+        )
+        attention_output = self.output(self_outputs[0], hidden_states)
+        # add attentions if we output them
+        outputs = (attention_output,) + self_outputs[1:]
+        return outputs
+
+
+class MplugOwlVisualAbstractorLayer(nn.Module):
+    def __init__(self, config, layer_idx):
+        super().__init__()
+        self.chunk_size_feed_forward = config.chunk_size_feed_forward
+        self.seq_len_dim = 1
+
+        self.layer_idx = layer_idx
+
+        self.crossattention = MplugOwlVisualAbstractorAttention(config)
+        self.has_cross_attention = True
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        output_attentions=False,
+    ):
+        if encoder_hidden_states is None:
+            raise ValueError("encoder_hidden_states must be given for cross-attention layers")
+        cross_attention_outputs = self.crossattention(
+            hidden_states,
+            attention_mask,
+            head_mask,
+            encoder_hidden_states,
+            encoder_attention_mask,
+            output_attentions=output_attentions,
+        )
+        query_attention_output = cross_attention_outputs[0]
+
+        outputs = (query_attention_output,)
+        return outputs
+
+
+class MplugOwlVisualAbstractorEncoder(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.layers = nn.ModuleList([MplugOwlVisualAbstractorLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)])
+        self.gradient_checkpointing = False
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_values=None,
+        output_attentions=False,
+        output_hidden_states=False,
+        return_dict=True,
+    ):
+        all_hidden_states = () if output_hidden_states else None
+
+        for i in range(self.config.num_hidden_layers):
+            layer_module = self.layers[i]
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+
+            layer_head_mask = head_mask[i] if head_mask is not None else None
+            past_key_value = past_key_values[i] if past_key_values is not None else None
+
+            if getattr(self.config, "gradient_checkpointing", False) and self.training:
+
+                def create_custom_forward(module):
+                    def custom_forward(*inputs):
+                        return module(*inputs, past_key_value, output_attentions)
+
+                    return custom_forward
+
+                layer_outputs = torch.utils.checkpoint.checkpoint(
+                    create_custom_forward(layer_module),
+                    hidden_states,
+                    attention_mask,
+                    layer_head_mask,
+                    encoder_hidden_states,
+                    encoder_attention_mask,
+                )
+            else:
+                layer_outputs = layer_module(
+                    hidden_states,
+                    attention_mask,
+                    layer_head_mask,
+                    encoder_hidden_states,
+                    encoder_attention_mask,
+                    output_attentions,
+                )
+
+            hidden_states = layer_outputs[0]
+
+        return BaseModelOutput(
+            last_hidden_state=hidden_states,
+        )
+
+
+class MplugOwlVisualAbstractorModel(MplugOwlPreTrainedModel):
+    def __init__(self, config: MplugOwlVisualAbstractorConfig, language_hidden_size):
+        super().__init__(config)
+        self.config = config
+
+        self.encoder = MplugOwlVisualAbstractorEncoder(config)
+        self.visual_fc = torch.nn.Linear(config.hidden_size, language_hidden_size)
+        self.temporal_visual_fc = torch.nn.Linear(config.hidden_size, language_hidden_size)
+        self.vit_eos = torch.nn.Parameter(torch.randn(1, 1, language_hidden_size))
+        nn.init.trunc_normal_(self.vit_eos, mean=0.0, std=self.config.initializer_range)
+        self.post_init()
+
+    def _prune_heads(self, heads_to_prune):
+        """
+        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
+        class PreTrainedModel
+        """
+        for layer, heads in heads_to_prune.items():
+            self.encoder.layer[layer].attention.prune_heads(heads)
+
+    def get_extended_attention_mask(
+        self,
+        attention_mask: torch.Tensor,
+        input_shape: Tuple[int],
+        device: torch.device,
+    ) -> torch.Tensor:
+        """
+        Makes broadcastable attention and causal masks so that future and masked tokens are ignored.
+
+        Arguments:
+            attention_mask (`torch.Tensor`):
+                Mask with ones indicating tokens to attend to, zeros for tokens to ignore.
+            input_shape (`Tuple[int]`):
+                The shape of the input to the model.
+            device: (`torch.device`):
+                The device of the input to the model.
+
+        Returns:
+            `torch.Tensor` The extended attention mask, with a the same dtype as `attention_mask.dtype`.
+        """
+        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
+        # ourselves in which case we just need to make it broadcastable to all heads.
+        if attention_mask.dim() == 3:
+            extended_attention_mask = attention_mask[:, None, :, :]
+        elif attention_mask.dim() == 2:
+            # Provided a padding mask of dimensions [batch_size, seq_length]
+            # - the model is an encoder, so make the mask broadcastable to [batch_size, num_heads, seq_length, seq_length]
+            extended_attention_mask = attention_mask[:, None, None, :]
+        else:
+            raise ValueError("Wrong shape for input_ids (shape {}) or attention_mask (shape {})".format(input_shape, attention_mask.shape))
+
+        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
+        # masked positions, this operation will create a tensor which is 0.0 for
+        # positions we want to attend and -10000.0 for masked positions.
+        # Since we are adding it to the raw scores before the softmax, this is
+        # effectively the same as removing these entirely.
+        extended_attention_mask = extended_attention_mask.to(dtype=self.dtype)  # fp16 compatibility
+        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
+        return extended_attention_mask
+
+    def forward(
+        self,
+        query_embeds,
+        temporal_query_embeds=None,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_values=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        encoder_hidden_states  (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, `optional`):
+            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
+            the model is configured as a decoder.
+        encoder_attention_mask (`torch.FloatTensor` of shape `(batch_size, sequence_length)`, `optional`):
+            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in
+            the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+        past_key_values (`tuple(tuple(torch.FloatTensor))` of length `config.n_layers` with each tuple having 4 tensors of:
+            shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`): Contains precomputed key and
+            value hidden states of the attention blocks. Can be used to speed up decoding. If `past_key_values` are
+            used, the user can optionally input only the last `decoder_input_ids` (those that don't have their past key
+            value states given to this model) of shape `(batch_size, 1)` instead of all `decoder_input_ids` of shape
+            `(batch_size, sequence_length)`.
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        T = encoder_hidden_states.size(1)
+        if T == 1 or temporal_query_embeds is None:
+            embedding_output = query_embeds
+        else:
+            embedding_output = torch.cat([query_embeds, temporal_query_embeds], dim=1)
+        input_shape = embedding_output.size()[:-1]
+        batch_size, seq_length = input_shape
+        device = embedding_output.device
+
+        encoder_hidden_states = einops.rearrange(encoder_hidden_states, "b t n d -> b (t n) d")
+
+        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
+        # ourselves in which case we just need to make it broadcastable to all heads.
+        if attention_mask is None:
+            attention_mask = torch.ones((embedding_output.shape[0], embedding_output.shape[1]), dtype=torch.long, device=embedding_output.device)
+        extended_attention_mask = self.get_extended_attention_mask(attention_mask, input_shape, device)
+
+        # If a 2D or 3D attention mask is provided for the cross-attention
+        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]
+        if encoder_hidden_states is not None:
+            if type(encoder_hidden_states) == list:
+                encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states[0].size()
+            else:
+                (
+                    encoder_batch_size,
+                    encoder_sequence_length,
+                    _,
+                ) = encoder_hidden_states.size()
+            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)
+
+            if type(encoder_attention_mask) == list:
+                encoder_extended_attention_mask = [self.invert_attention_mask(mask) for mask in encoder_attention_mask]
+            elif encoder_attention_mask is None:
+                encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)
+                encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)
+            else:
+                encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)
+        else:
+            encoder_extended_attention_mask = None
+
+        # Prepare head mask if needed
+        # 1.0 in head_mask indicate we keep the head
+        # attention_probs has shape bsz x n_heads x N x N
+        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
+        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
+        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
+
+        encoder_outputs = self.encoder(
+            embedding_output,
+            attention_mask=extended_attention_mask,
+            head_mask=head_mask,
+            encoder_hidden_states=encoder_hidden_states,
+            encoder_attention_mask=encoder_extended_attention_mask,
+            past_key_values=past_key_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        sequence_output = encoder_outputs[0]
+        pooled_output = sequence_output[:, 0, :]
+
+        if T == 1 or temporal_query_embeds is None:
+            temporal_sequence_output = None
+        else:
+            temporal_sequence_output = sequence_output[:, query_embeds.size(1) :]
+            sequence_output = sequence_output[:, : query_embeds.size(1)]
+
+        sequence_output = self.visual_fc(sequence_output)
+        if temporal_sequence_output is not None:
+            sequence_output += self.temporal_visual_fc(temporal_sequence_output)
+        sequence_output = torch.cat([sequence_output, self.vit_eos.repeat(sequence_output.shape[0], 1, 1)], dim=1)
+
+        return BaseModelOutputWithPooling(
+            last_hidden_state=sequence_output,
+            pooler_output=pooled_output,
+            hidden_states=encoder_outputs.hidden_states,
+        )
+
+
+@add_start_docstrings(
+    """
+    mPLUG-Owl Model for generating text and image features. The model consists of a vision encoder, Querying Transformer
+    (Q-Former) and a language model.
+    """,
+    MPLUG_OWL_START_DOCSTRING,
+)
+class MplugOwlModel(MplugOwlPreTrainedModel):
+    config_class = MplugOwlConfig
+    main_input_name = "pixel_values"
+
+    def __init__(self, config: MplugOwlConfig, *inputs, **kwargs):
+        super().__init__(config, *inputs, **kwargs)
+
+        self.vision_model = MplugOwlVisionModel(config.vision_config)
+
+        self.query_tokens = nn.Parameter(torch.zeros(1, config.num_query_tokens, config.visual_abstractor_config.hidden_size))
+        self.temporal_query_tokens = nn.Parameter(torch.zeros(1, config.num_query_tokens, config.visual_abstractor_config.hidden_size))
+        self.abstractor = MplugOwlVisualAbstractorModel(config.visual_abstractor_config, config.text_config.hidden_size)
+
+        # if config.use_decoder_only_language_model:
+        # from llama.modeling_llama import LlamaForCausalLM
+        language_model = AutoModelForCausalLM.from_config(config.text_config)
+        # else:
+        #     language_model = AutoModelForSeq2SeqLM.from_config(config.text_config)
+        self.language_model = language_model
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_input_embeddings(self):
+        return self.language_model.get_input_embeddings()
+
+    def set_input_embeddings(self, value):
+        self.language_model.set_input_embeddings(value)
+
+    def set_output_embeddings(self, new_embeddings):
+        self.language_model.set_output_embeddings(new_embeddings)
+
+    def get_output_embeddings(self) -> nn.Module:
+        return self.language_model.get_output_embeddings()
+
+    def get_encoder(self):
+        return self.language_model.get_encoder()
+
+    def get_decoder(self):
+        return self.language_model.get_decoder()
+
+    def _tie_weights(self):
+        if not self.config.use_decoder_only_language_model:
+            self.language_model.encoder.embed_tokens = self.language_model.shared
+            self.language_model.decoder.embed_tokens = self.language_model.shared
+
+    def get_text_features(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        decoder_input_ids: Optional[torch.Tensor] = None,
+        decoder_attention_mask: Optional[torch.Tensor] = None,
+        labels: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if self.config.use_decoder_only_language_model:
+            text_outputs = self.language_model(
+                input_ids=input_ids,
+                attention_mask=attention_mask,
+                output_attentions=output_attentions,
+                output_hidden_states=output_hidden_states,
+                return_dict=return_dict,
+            )
+        else:
+            inputs_embeds = self.language_model.get_input_embeddings()(input_ids)
+
+            text_outputs = self.language_model(
+                inputs_embeds=inputs_embeds,
+                attention_mask=attention_mask,
+                decoder_input_ids=decoder_input_ids,
+                decoder_attention_mask=decoder_attention_mask,
+                output_attentions=output_attentions,
+                output_hidden_states=output_hidden_states,
+                return_dict=return_dict,
+                labels=labels,
+            )
+
+        return text_outputs
+
+    def get_image_features(
+        self,
+        pixel_values: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        vision_outputs = self.vision_model(
+            pixel_values=pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        return vision_outputs
+
+
+def get_media_indices(my_list):
+    if isinstance(my_list, torch.Tensor):
+        my_list = my_list.cpu().tolist()
+    result = []
+    for i in range(len(my_list)):
+        if i == 0 and my_list[i] < 0:
+            result.append(i)
+        elif my_list[i] != my_list[i - 1] and my_list[i] < 0:
+            result.append(i)
+    return result
+
+
+def get_media_types(tensors, positions):
+    if isinstance(tensors, torch.Tensor):
+        tensors = tensors.cpu().tolist()
+    result = []
+    for pos in positions:
+        result.append(tensors[pos])
+    return result
+
+
+@add_start_docstrings(
+    """
+    mPLUG-Owl Model for generating text given an image and an optional text prompt.
+    """,
+    MPLUG_OWL_START_DOCSTRING,
+)
+class MplugOwlForConditionalGeneration(MplugOwlPreTrainedModel):
+    config_class = MplugOwlConfig
+    main_input_name = "pixel_values"
+
+    def __init__(self, config: MplugOwlConfig):
+        super().__init__(config)
+
+        self.vision_model = MplugOwlVisionModel(config.vision_config)
+
+        self.query_tokens = nn.Parameter(torch.zeros(1, config.num_query_tokens, config.visual_abstractor_config.hidden_size))
+        self.temporal_query_tokens = nn.Parameter(torch.zeros(1, config.num_query_tokens, config.visual_abstractor_config.hidden_size))
+        self.abstractor = MplugOwlVisualAbstractorModel(config.visual_abstractor_config, config.text_config.hidden_size)
+
+        # if config.use_decoder_only_language_model:
+        # from llama.modeling_llama import LlamaForCausalLM
+        language_model = AutoModelForCausalLM.from_config(config.text_config)
+        # else:
+        #     language_model = AutoModelForSeq2SeqLM.from_config(config.text_config)
+        self.language_model = language_model
+
+        # Initialize weights and apply final processing
+        self.post_init()
+        self.main_input_name = "input_ids"
+        from transformers import GenerationConfig
+
+        self.generation_config = GenerationConfig(max_length=512, do_sample=True, top_k=3, pad_token_id=0, unk_token_id=0, bos_token_id=1, eos_token_id=2)
+
+        # Hack Bloom
+        if config.text_config.model_type == "bloom":
+            bound_method = bloom_forward.__get__(self.language_model.transformer, self.language_model.transformer.__class__)
+            setattr(self.language_model.transformer, "forward", bound_method)
+
+    def get_input_embeddings(self):
+        return self.language_model.get_input_embeddings()
+
+    def set_input_embeddings(self, value):
+        self.language_model.set_input_embeddings(value)
+
+    def set_output_embeddings(self, new_embeddings):
+        self.language_model.set_output_embeddings(new_embeddings)
+
+    def get_output_embeddings(self) -> nn.Module:
+        return self.language_model.get_output_embeddings()
+
+    def get_encoder(self):
+        return self.language_model.get_encoder()
+
+    def get_decoder(self):
+        return self.language_model.get_decoder()
+
+    def _tie_weights(self):
+        if not self.config.use_decoder_only_language_model:
+            self.language_model.encoder.embed_tokens = self.language_model.shared
+            self.language_model.decoder.embed_tokens = self.language_model.shared
+
+    def _preprocess_accelerate(self):
+        r"""
+        Some pre-processing hacks to make the model `accelerate` compatible. Check
+        https://github.com/huggingface/transformers/pull/21707 for more details.
+        """
+        hf_device_map = self.hf_device_map
+
+        if len(hf_device_map) > 1 and "language_model" not in hf_device_map and torch.cuda.device_count() > 1:
+            # warn users about unexpected behavior when using multi-GPU + mPLUG-Owl + `accelerate`.
+            logger.warning(
+                "The `language_model` is not in the `hf_device_map` dictionary and you are running your script"
+                " in a multi-GPU environment. this may lead to unexpected behavior when using `accelerate`."
+                " Please pass a `device_map` that contains `language_model` to remove this warning."
+                " Please refer to https://github.com/huggingface/blog/blob/main/accelerate-large-models.md for"
+                " more details on creating a `device_map` for large models.",
+            )
+
+        if hasattr(self.language_model, "_hf_hook"):
+            self.language_model._hf_hook.io_same_device = True  # For `generate` compatibility
+
+    @add_start_docstrings_to_model_forward(MPLUG_OWL_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=MplugOwlForConditionalGenerationModelOutput, config_class=MplugOwlVisionConfig)
+    def forward(
+        self,
+        pixel_values: torch.FloatTensor,
+        video_pixel_values: torch.FloatTensor,
+        input_ids: torch.FloatTensor,
+        num_images,
+        num_videos,
+        non_padding_mask: Optional[torch.LongTensor] = None,
+        non_media_mask: Optional[torch.LongTensor] = None,
+        prompt_mask: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.LongTensor] = None,
+        decoder_input_ids: Optional[torch.LongTensor] = None,
+        decoder_attention_mask: Optional[torch.LongTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        labels: Optional[torch.LongTensor] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, MplugOwlForConditionalGenerationModelOutput]:
+        r"""
+        Returns:
+
+        Examples:
+
+        Image captioning (without providing a text prompt):
+
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> from transformers import MplugOwlProcessor, MplugOwlForConditionalGeneration
+        >>> import torch
+
+        >>> device = "cuda" if torch.cuda.is_available() else "cpu"
+
+        >>> processor = MplugOwlProcessor.from_pretrained("x-plug/x_plug-llama-7b")
+        >>> model = MplugOwlForConditionalGeneration.from_pretrained(
+        ...     "x-plug/x_plug-llama-7b", torch_dtype=torch.float16
+        ... )
+        >>> model.to(device)  # doctest: +IGNORE_RESULT
+
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+
+        >>> inputs = processor(images=image, return_tensors="pt").to(device, torch.float16)
+
+        >>> generated_ids = model.generate(**inputs)
+        >>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
+        >>> print(generated_text)
+        two cats laying on a couch
+        ```
+
+        Visual question answering (prompt = question):
+
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> from transformers import MplugOwlProcessor, MplugOwlForConditionalGeneration
+        >>> import torch
+
+        >>> device = "cuda" if torch.cuda.is_available() else "cpu"
+
+        >>> processor = MplugOwlProcessor.from_pretrained("x-plug/x_plug-llama-7b")
+        >>> model = MplugOwlForConditionalGeneration.from_pretrained(
+        ...     "x-plug/x_plug-llama-7b", torch_dtype=torch.float16
+        ... )
+        >>> model.to(device)  # doctest: +IGNORE_RESULT
+
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+
+        >>> prompt = "Question: how many cats are there? Answer:"
+        >>> inputs = processor(images=image, text=prompt, return_tensors="pt").to(device, torch.float16)
+
+        >>> generated_ids = model.generate(**inputs)
+        >>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
+        >>> print(generated_text)
+        two
+        ```"""
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        # get text embedding
+        text_tokens_ = input_ids.clone()
+        batch_size = input_ids.shape[0]
+        # labels = text_tokens_[:, 1:].clone().contiguous()
+
+        media_token_indices = [
+            # [:-1] since we would not use the last token for embedding
+            get_media_indices(text_tokens_[i][:-1])
+            for i in range(batch_size)
+        ]
+
+        media_token_types = [get_media_types(text_tokens_[i][:-1], media_token_indices[i]) for i in range(batch_size)]
+
+        text_tokens_[text_tokens_ < 0] = 1  # Not used
+        # text_tokens = text_tokens_[:, :-1].contiguous()
+        text_embeds = self.get_input_embeddings()(text_tokens_)  # Temporally Embedding
+
+        if pixel_values is not None:
+            image_embeds = self.vision_model(pixel_values, return_dict=True).last_hidden_state
+
+            image_attention_mask = torch.ones(image_embeds.size()[:-1], dtype=torch.long, device=image_embeds.device)
+            query_tokens = self.query_tokens.expand(image_embeds.shape[0], -1, -1)
+            temporal_query_tokens = self.temporal_query_tokens.expand(image_embeds.shape[0], -1, -1)
+
+            query_features = self.abstractor(
+                query_embeds=query_tokens,
+                encoder_hidden_states=image_embeds,
+                encoder_attention_mask=image_attention_mask,
+            )["last_hidden_state"]
+            img_seq_length = query_features.shape[1]
+
+        if video_pixel_values is not None:
+            video_embeds = self.vision_model(video_pixel_values, return_dict=True).last_hidden_state
+
+            video_attention_mask = torch.ones(video_embeds.size()[:-1], dtype=torch.long, device=video_embeds.device)
+            video_attention_mask = einops.rearrange(video_attention_mask, "b t n -> b (t n)")
+            query_tokens = self.query_tokens.expand(video_embeds.shape[0], -1, -1)
+            temporal_query_tokens = self.temporal_query_tokens.expand(video_embeds.shape[0], -1, -1)
+
+            video_query_features = self.abstractor(
+                query_embeds=query_tokens,
+                temporal_query_embeds=temporal_query_tokens,
+                encoder_hidden_states=video_embeds,
+                encoder_attention_mask=video_attention_mask,
+            )["last_hidden_state"]
+            vid_seq_length = video_query_features.shape[1]
+
+        num_images_per_sample = num_images.long().cpu().tolist()
+        num_videos_per_sample = num_videos.long().cpu().tolist()
+
+        text_chunk_embeds = []
+        img_idx = 0
+        for b in range(batch_size):
+            start = 0
+            result = []
+            if len(media_token_indices[b]) > 0:
+                for i, pos in enumerate(media_token_indices[b]):
+                    if pos > start:
+                        result.append(text_embeds[b, start:pos])
+                    result.append(query_features[img_idx + i])
+                    start = pos + img_seq_length
+            if start < text_embeds.shape[1]:
+                result.append(text_embeds[b, start:])
+
+            img_idx += num_images_per_sample[b]
+            text_chunk_embeds.append(torch.cat(result, dim=0))
+
+        # Actual Input Embeddings
+        input_embeds = torch.stack(text_chunk_embeds, dim=0)
+
+        # if pixel_values is None and self.language_model.is_gradient_checkpointing:
+        #     # Hack here when gradient checkpoint is enable.
+        #     # Keep the compute graph static
+        #     image_embeds = self.vision_model(torch.zeros(1,3,224,224,device=input_embeds.device,dtype=input_embeds.dtype), return_dict=True).last_hidden_state
+        #     query_tokens = self.query_tokens.expand(
+        #         image_embeds.shape[0], -1, -1)
+        #     query_features = self.abstractor(query_embeds=query_tokens,
+        #     encoder_hidden_states=image_embeds,)['last_hidden_state']
+
+        #     input_embeds = input_embeds + query_features.mean()*0
+
+        # Create causal mask and position ids
+        _, loss_mask, position_ids = get_ltor_masks_and_position_ids_from_embeddings(input_embeds)
+
+        # Calculate the loss_mask
+        non_padding_mask = non_padding_mask.long()
+        non_media_mask = non_media_mask.long()
+        prompt_mask = prompt_mask.long()  # TODO How to deal with prompt mask
+        # from icecream import ic
+        # non_padding_mask = non_padding_mask[:,:-1]
+        # non_media_mask = non_media_mask[:,:-1]
+        # prompt_mask = prompt_mask[:,:-1]
+        # attention_mask = attention_mask[:,:-1]
+        loss_mask = loss_mask[:, :-1]
+
+        loss_mask = loss_mask * non_padding_mask * non_media_mask * prompt_mask
+        labels[:, 1:][loss_mask != 1] = -100
+        # Forward into GPT
+        outputs = self.language_model(
+            inputs_embeds=input_embeds,
+            attention_mask=attention_mask,
+            labels=labels,
+            return_dict=return_dict,
+            output_attentions=self.config.output_attentions,
+        )
+        # outputs.loss = (outputs.loss * loss_mask.view(-1)
+        #                 ).sum()/loss_mask.sum()
+        return outputs
+
+    @torch.no_grad()
+    def generate(
+        self,
+        pixel_values: torch.FloatTensor = None,
+        video_pixel_values: torch.FloatTensor = None,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.LongTensor] = None,
+        isdecoder=True,
+        **generate_kwargs,
+    ) -> torch.LongTensor:
+        """
+        Overrides `generate` function to be able to use the model as a conditional generator.
+
+        Args:
+            pixel_values (`torch.FloatTensor` of shape (batch_size, num_channels, height, width)):
+                Input images to be processed.
+            input_ids (`torch.LongTensor` of shape (batch_size, sequence_length), *optional*):
+                The sequence used as a prompt for the generation.
+            attention_mask (`torch.LongTensor` of shape (batch_size, sequence_length), *optional*):
+                Mask to avoid performing attention on padding token indices
+
+        Returns:
+            captions (list): A list of strings of length batch_size * num_captions.
+        """
+        if input_ids is None:
+            return self.language_model.generate(attention_mask=attention_mask, **generate_kwargs)
+
+        if attention_mask is None:
+            attention_mask = input_ids.new_ones(*input_ids.shape)
+
+        batch_size = input_ids.size(0)
+        media_token_indices = [get_media_indices(input_ids[i]) for i in range(batch_size)]
+        media_token_types = [get_media_types(input_ids[i], media_token_indices[i]) for i in range(batch_size)]
+        num_images_per_sample = [len([y for y in x if y == -1]) for x in media_token_types]
+        num_videos_per_sample = [len([y for y in x if y < -1]) for x in media_token_types]
+        input_ids = input_ids.clone()  # prevent inplace modify
+        input_ids[input_ids < 0] = 0  # Not used
+
+        if hasattr(self, "hf_device_map"):
+            # preprocess for `accelerate`
+            self._preprocess_accelerate()
+        batch_size = input_ids.shape[0]
+        # get text embedding
+        inputs_embeds = self.get_input_embeddings()(input_ids)
+        if hasattr(self.language_model, "transformer") and hasattr(self.language_model.transformer, "word_embeddings_layernorm"):
+            inputs_embeds = self.language_model.transformer.word_embeddings_layernorm(inputs_embeds)
+        # get visual embedding
+        if pixel_values is not None:
+            pixel_values = pixel_values.to(input_ids.device)
+            with torch.no_grad():
+                image_embeds = self.vision_model(pixel_values, return_dict=True).last_hidden_state
+                image_attention_mask = torch.ones(image_embeds.size()[:-1], dtype=torch.long, device=image_embeds.device)
+                query_tokens = self.query_tokens.expand(image_embeds.shape[0], -1, -1)
+                query_outputs = self.abstractor(
+                    query_embeds=query_tokens,
+                    encoder_hidden_states=image_embeds,
+                    encoder_attention_mask=image_attention_mask,
+                    return_dict=True,
+                )
+                query_output = query_outputs["last_hidden_state"]
+                image_embeds = query_output
+            img_seq_length = image_embeds.shape[1]
+
+        if video_pixel_values is not None:
+            video_pixel_values = video_pixel_values.to(input_ids.device)
+            with torch.no_grad():
+                video_embeds = self.vision_model(video_pixel_values, return_dict=True).last_hidden_state
+                video_attention_mask = torch.ones(video_embeds.size()[:-1], dtype=torch.long, device=video_embeds.device)
+                video_attention_mask = einops.rearrange(video_attention_mask, "b t n -> b (t n)")
+                query_tokens = self.query_tokens.expand(video_embeds.shape[0], -1, -1)
+                temporal_query_tokens = self.temporal_query_tokens.expand(video_embeds.shape[0], -1, -1)
+                query_outputs = self.abstractor(
+                    query_embeds=query_tokens,
+                    temporal_query_embeds=temporal_query_tokens,
+                    encoder_hidden_states=video_embeds,
+                    encoder_attention_mask=video_attention_mask,
+                    return_dict=True,
+                )
+                query_output = query_outputs["last_hidden_state"]
+                video_embeds = query_output
+            vid_seq_length = video_embeds.shape[1]
+
+        # ===================
+        # Get actual input embeddings
+        # ===================
+        text_chunk_embeds = []
+        text_chunk_attns = []
+        img_idx = 0
+        vid_idx = 0
+
+        for b in range(batch_size):
+            start = 0
+            result = []
+            result_attn = []
+            for i, pos in enumerate(media_token_indices[b]):
+                curr_image_idx, curr_video_idx = 0, 0
+                if pos > start:
+                    result.append(inputs_embeds[b, start:pos])
+                    result_attn.append(attention_mask[b, start:pos])
+                if media_token_types[b][i] == -1:
+                    result.append(image_embeds[img_idx + curr_image_idx])
+                    result_attn.append(torch.ones(image_embeds[img_idx + curr_image_idx].shape[0], device=inputs_embeds.device))
+                    start = pos + img_seq_length
+                    curr_image_idx += 1
+                else:
+                    result.append(video_embeds[vid_idx + curr_video_idx])
+                    result_attn.append(torch.ones(video_embeds[img_idx + curr_video_idx].shape[0], device=inputs_embeds.device))
+                    start = pos + vid_seq_length
+                    curr_video_idx += 1
+            if start < inputs_embeds.shape[1]:
+                result.append(inputs_embeds[b, start:])
+                result_attn.append(attention_mask[b, start:])
+
+            img_idx += num_images_per_sample[b]
+            vid_idx += num_videos_per_sample[b]
+            text_chunk_embeds.append(torch.cat(result, dim=0))
+            text_chunk_attns.append(torch.cat(result_attn, dim=0))
+        inputs_embeds = torch.stack(text_chunk_embeds, dim=0)
+        attention_mask = torch.stack(text_chunk_attns, dim=0)
+
+        outputs = self.language_model.generate(
+            inputs_embeds=inputs_embeds,
+            # input_ids=input_ids,
+            attention_mask=attention_mask,
+            **generate_kwargs,
+        )
+
+        return outputs
+
+    def prepare_inputs_for_generation(self, input_ids, pixel_values=None, video_pixel_values=None, past_key_values=None, attention_mask=None, **model_kwargs):
+        input_shape = input_ids.shape
+        # if model is used as a decoder in encoder-decoder model, the decoder attention mask is created on the fly
+        if attention_mask is None:
+            attention_mask = input_ids.new_ones(input_shape)
+
+        # # cut decoder_input_ids if past_key_values is used
+        # if past_key_values is not None:
+        #     input_ids = input_ids[:, -1:]
+
+        return {
+            "input_ids": input_ids,
+            "pixel_values": pixel_values,
+            "video_pixel_values": video_pixel_values,
+            "attention_mask": attention_mask,
+            # "past_key_values": past_key_values,
+            # "encoder_hidden_states": model_kwargs.get("encoder_hidden_states", None),
+            # "encoder_attention_mask": model_kwargs.get("encoder_attention_mask", None),
+            "is_decoder": True,
+        }
+
+
+def bloom_forward(
+    self,
+    input_ids: Optional[torch.LongTensor] = None,
+    past_key_values: Optional[Tuple[Tuple[torch.Tensor, torch.Tensor], ...]] = None,
+    attention_mask: Optional[torch.Tensor] = None,
+    head_mask: Optional[torch.LongTensor] = None,
+    inputs_embeds: Optional[torch.LongTensor] = None,
+    use_cache: Optional[bool] = None,
+    output_attentions: Optional[bool] = None,
+    output_hidden_states: Optional[bool] = None,
+    return_dict: Optional[bool] = None,
+    **deprecated_arguments,
+) -> Union[Tuple[torch.Tensor, ...], BaseModelOutputWithPastAndCrossAttentions]:
+    if deprecated_arguments.pop("position_ids", False) is not False:
+        # `position_ids` could have been `torch.Tensor` or `None` so defaulting pop to `False` allows to detect if users were passing explicitly `None`
+        warnings.warn(
+            "`position_ids` have no functionality in BLOOM and will be removed in v5.0.0. You can safely ignore" " passing `position_ids`.",
+            FutureWarning,
+        )
+    if len(deprecated_arguments) > 0:
+        raise ValueError(f"Got unexpected arguments: {deprecated_arguments}")
+
+    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+    output_hidden_states = output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+    use_cache = use_cache if use_cache is not None else self.config.use_cache
+    return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+    if input_ids is not None and inputs_embeds is not None:
+        raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+    elif input_ids is not None:
+        batch_size, seq_length = input_ids.shape
+    elif inputs_embeds is not None:
+        batch_size, seq_length, _ = inputs_embeds.shape
+    else:
+        raise ValueError("You have to specify either input_ids or inputs_embeds")
+
+    if past_key_values is None:
+        past_key_values = tuple([None] * len(self.h))
+
+    # Prepare head mask if needed
+    # 1.0 in head_mask indicate we keep the head
+    # attention_probs has shape batch_size x num_heads x N x N
+    # head_mask has shape n_layer x batch x num_heads x N x N
+    head_mask = self.get_head_mask(head_mask, self.config.n_layer)
+
+    if inputs_embeds is None:
+        inputs_embeds = self.word_embeddings(input_ids)
+        inputs_embeds = self.word_embeddings_layernorm(inputs_embeds)
+
+    hidden_states = inputs_embeds
+
+    presents = () if use_cache else None
+    all_self_attentions = () if output_attentions else None
+    all_hidden_states = () if output_hidden_states else None
+
+    if self.gradient_checkpointing and self.training:
+        if use_cache:
+            logger.warning_once("`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...")
+            use_cache = False
+
+    # Compute alibi tensor: check build_alibi_tensor documentation
+    seq_length_with_past = seq_length
+    past_key_values_length = 0
+    if past_key_values[0] is not None:
+        past_key_values_length = past_key_values[0][0].shape[2]
+        seq_length_with_past = seq_length_with_past + past_key_values_length
+    if attention_mask is None:
+        attention_mask = torch.ones((batch_size, seq_length_with_past), device=hidden_states.device)
+    else:
+        attention_mask = attention_mask.to(hidden_states.device)
+
+    alibi = self.build_alibi_tensor(attention_mask, self.num_heads, dtype=hidden_states.dtype)
+
+    causal_mask = self._prepare_attn_mask(
+        attention_mask,
+        input_shape=(batch_size, seq_length),
+        past_key_values_length=past_key_values_length,
+    )
+
+    for i, (block, layer_past) in enumerate(zip(self.h, past_key_values)):
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+
+        if self.gradient_checkpointing and self.training:
+
+            def create_custom_forward(module):
+                def custom_forward(*inputs):
+                    # None for past_key_value
+                    return module(*inputs, use_cache=use_cache, output_attentions=output_attentions)
+
+                return custom_forward
+
+            outputs = torch.utils.checkpoint.checkpoint(
+                create_custom_forward(block),
+                hidden_states,
+                alibi,
+                causal_mask,
+                layer_past,
+                head_mask[i],
+            )
+        else:
+            outputs = block(
+                hidden_states,
+                layer_past=layer_past,
+                attention_mask=causal_mask,
+                head_mask=head_mask[i],
+                use_cache=use_cache,
+                output_attentions=output_attentions,
+                alibi=alibi,
+            )
+
+        hidden_states = outputs[0]
+        if use_cache is True:
+            presents = presents + (outputs[1],)
+
+        if output_attentions:
+            all_self_attentions = all_self_attentions + (outputs[2 if use_cache else 1],)
+
+    # Add last hidden state
+    hidden_states = self.ln_f(hidden_states)
+
+    if output_hidden_states:
+        all_hidden_states = all_hidden_states + (hidden_states,)
+
+    if not return_dict:
+        return tuple(v for v in [hidden_states, presents, all_hidden_states, all_self_attentions] if v is not None)
+
+    return BaseModelOutputWithPastAndCrossAttentions(
+        last_hidden_state=hidden_states,
+        past_key_values=presents,
+        hidden_states=all_hidden_states,
+        attentions=all_self_attentions,
+    )
diff --git a/lmms_eval/models/mplug_owl_video/processing_mplug_owl.py b/lmms_eval/models/mplug_owl_video/processing_mplug_owl.py
new file mode 100644
index 00000000..38cbf023
--- /dev/null
+++ b/lmms_eval/models/mplug_owl_video/processing_mplug_owl.py
@@ -0,0 +1,262 @@
+import re
+import torch
+import torch.utils.checkpoint
+
+from transformers.processing_utils import ProcessorMixin
+from transformers.tokenization_utils_base import BatchEncoding
+from transformers.models.clip.image_processing_clip import CLIPImageProcessor
+from .tokenization_mplug_owl import MplugOwlTokenizer
+
+from decord import VideoReader
+import numpy as np
+from PIL import Image
+from lmms_eval.models.model_utils.load_video import read_video_pyav
+
+
+def get_index(num_frames, num_segments):
+    seg_size = float(num_frames - 1) / num_segments
+    start = int(seg_size / 2)
+    offsets = np.array([start + int(np.round(seg_size * idx)) for idx in range(num_segments)])
+    return offsets
+
+
+def load_video(path, num_frames=4):
+    """vr = VideoReader(path, height=224, width=224)
+    total_frames = len(vr)
+    frame_indices = get_index(total_frames, num_frames)
+    images_group = list()
+    for frame_index in frame_indices:
+        img = Image.fromarray(vr[frame_index].asnumpy()).convert("RGB")
+        images_group.append(img)
+    return images_group"""
+    # Change a bit here from the original code
+    # I use pyav instead of decord because it is much more safer
+    # The operations here are the same, we load video and return a list of PIL Image
+    # Load video frames
+    video_frames = read_video_pyav(path, num_frm=num_frames)
+    target_h, target_w = 224, 224
+    # If image shape is not as target, resize it
+    if video_frames.shape[-3] != target_h or video_frames.shape[-2] != target_w:
+        video_frames = torch.from_numpy(video_frames).permute(0, 3, 1, 2).float()
+        video_frames = torch.nn.functional.interpolate(video_frames, size=(target_h, target_w))
+        video_frames = video_frames.permute(0, 2, 3, 1).to(torch.uint8).numpy()
+    video_frames = [Image.fromarray(frame) for frame in video_frames]
+    if len(video_frames) > num_frames:
+        video_frames = video_frames[:num_frames]
+    return video_frames
+
+
+class MplugOwlProcessor(ProcessorMixin):
+    attributes = []
+    tokenizer_class = "MplugOwlTokenizer"
+
+    def __init__(self, image_processor=None, tokenizer=None, **kwargs):
+        super().__init__(**kwargs)
+        self.tokens_to_generate = 0
+        self.image_processor = image_processor
+        self.tokenizer = tokenizer
+        self.add_BOS = True
+
+    def __call__(self, text=None, images=None, videos=None, num_frames=4, return_tensors=None, **kwargs):
+        if text is None and images is None:
+            raise ValueError("You have to specify either text or images. Both cannot be none.")
+
+        if text is not None:
+            encoding = tokenize_prompts(
+                prompts=text,
+                tokens_to_generate=self.tokens_to_generate,
+                add_BOS=self.add_BOS,
+                tokenizer=self.tokenizer,
+                ignore_dist=True,
+                **kwargs,
+            )
+            # encoding = self.tokenizer(text, return_tensors=return_tensors, **kwargs)
+
+        if images is not None:
+            image_features = self.image_processor(images, return_tensors=return_tensors, **kwargs)
+
+        if videos is not None:
+            video_features = []
+            for video in videos:
+                video_frames = load_video(video, num_frames)
+                video_feature = self.image_processor(video_frames, return_tensors=return_tensors, **kwargs)["pixel_values"]
+                video_features.append(video_feature)
+            video_features = torch.stack(video_features, dim=0)
+            video_features = video_features.permute(0, 2, 1, 3, 4)
+
+        if text is not None and images is not None:
+            encoding["pixel_values"] = image_features.pixel_values
+            return encoding
+        if text is not None and videos is not None:
+            encoding["video_pixel_values"] = video_features
+            return encoding
+        elif text is not None:
+            return encoding
+        elif images is not None:
+            return BatchEncoding(data=dict(**image_features), tensor_type=return_tensors)
+        else:
+            return BatchEncoding(data=dict(video_pixel_values=video_pixel_values), tensor_type=return_tensors)
+
+    def batch_decode(self, skip_special_tokens=True, *args, **kwargs):
+        """
+        This method forwards all its arguments to CLIPTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
+        refer to the docstring of this method for more information.
+        """
+        return self.tokenizer.batch_decode(*args, skip_special_tokens=skip_special_tokens, **kwargs)
+
+    def decode(self, skip_special_tokens=True, *args, **kwargs):
+        """
+        This method forwards all its arguments to CLIPTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
+        the docstring of this method for more information.
+        """
+        return self.tokenizer.decode(*args, skip_special_tokens=skip_special_tokens, **kwargs)
+
+
+class MplugOwlImageProcessor(CLIPImageProcessor):
+    pass
+
+
+def detokenize_generations(tokens_gpu_tensor, lengths_gpu_tensor, return_segments, tokenizer):
+    """Detokenize the generated tokens."""
+
+    prompts_plus_generations = []
+    if return_segments:
+        prompts_plus_generations_segments = []
+
+    tokens = tokens_gpu_tensor.cpu().numpy().tolist()
+    lengths = lengths_gpu_tensor.cpu().numpy().tolist()
+    for sequence_tokens, length in zip(tokens, lengths):
+        sequence_tokens = sequence_tokens[:length]
+        prompts_plus_generations.append(tokenizer.detokenize(sequence_tokens))
+        if return_segments:
+            from tokenizers.decoders import Metaspace
+
+            if hasattr(tokenizer, "tokenizer"):
+                if isinstance(tokenizer.tokenizer.decoder, Metaspace):
+                    words = tokenizer.tokenizer.decode(sequence_tokens)
+                else:
+                    words = []
+                    for token in sequence_tokens:
+                        word = tokenizer.tokenizer.decoder[token]
+                        word = bytearray([tokenizer.tokenizer.byte_decoder[c] for c in word]).decode("utf-8", errors="replace")
+                        words.append(word)
+                prompts_plus_generations_segments.append(words)
+            else:
+                words = tokenizer.detokenize(sequence_tokens)
+                # else:
+                #     words = []
+                #     for token in sequence_tokens:
+                #         word = tokenizer.tokenizer.decoder[token]
+                #         word = bytearray(
+                #             [tokenizer.tokenizer.byte_decoder[c] for c in word]).decode(
+                #                 'utf-8', errors='replace')
+                #         words.append(word)
+                prompts_plus_generations_segments.append(words)
+
+    if return_segments:
+        return tokens, prompts_plus_generations, prompts_plus_generations_segments
+
+    return tokens, prompts_plus_generations
+
+
+def tokenize_prompts(prompts=None, tokens_to_generate=None, add_BOS=None, rank=0, tokenizer=None, ignore_dist=False, **kwargs):
+    """Tokenize prompts and make them avaiable on all ranks."""
+
+    # On all ranks set to None so we can pass them to functions
+    prompts_tokens_cuda_long_tensor = None
+    prompts_length_cuda_long_tensor = None
+
+    # On the specified rank, build the above.
+    attention_mask = None
+    if ignore_dist or torch.distributed.get_rank() == rank:
+        assert prompts is not None
+        assert tokens_to_generate is not None
+        # Tensor of tokens padded and their unpadded length.
+        prompts_tokens_cuda_long_tensor, prompts_length_cuda_long_tensor, attention_mask = _tokenize_prompts_and_batch(prompts, tokens_to_generate, add_BOS, tokenizer, **kwargs)
+        # We need the sizes of these tensors for the boradcast
+        [
+            prompts_tokens_cuda_long_tensor.size(0),  # Batch size
+            prompts_tokens_cuda_long_tensor.size(1),
+        ]  # Sequence lenght
+
+    return {
+        "input_ids": prompts_tokens_cuda_long_tensor,
+        "attention_mask": attention_mask,
+        # "prompt_length": prompts_length_cuda_long_tensor,
+    }
+
+
+def _tokenize_prompts_and_batch(prompts, tokens_to_generate, add_BOS, tokenizer, **kwargs):
+    """Given a set of prompts and number of tokens to generate:
+    - tokenize prompts
+    - set the sequence length to be the max of length of prompts
+      plus the number of tokens we would like to generate
+    - pad all the sequences to this length so we can convert them
+      into a 2D tensor.
+    """
+
+    # Tokenize all the prompts.
+    # if add_BOS:
+    #     prompts_tokens = [[tokenizer.bos] + tokenizer.tokenize(prompt)
+    #                       for prompt in prompts]
+    # else:
+    #     prompts_tokens = [tokenizer.tokenize(prompt) for prompt in prompts]
+
+    prompts_tokens = [_tokenize_prompt(prompt, tokenizer, add_BOS, **kwargs) for prompt in prompts]
+
+    # Now we have a list of list of tokens which each list has a different
+    # size. We want to extend this list to:
+    #   - incorporate the tokens that need to be generated
+    #   - make all the sequences equal length.
+    # Get the prompts length.
+    prompts_length = [len(prompt_tokens) for prompt_tokens in prompts_tokens]
+    # Get the max prompts length.
+    max_prompt_len = max(prompts_length)
+    # Number of tokens in the each sample of the batch.
+    samples_length = max_prompt_len + tokens_to_generate
+    # Now update the list of list to be of the same size: samples_length.
+    for prompt_tokens, prompt_length in zip(prompts_tokens, prompts_length):
+        padding_size = samples_length - prompt_length
+        prompt_tokens.extend([tokenizer.eos_token_id] * padding_size)
+
+    # Now we are in a structured format, we can convert to tensors.
+    prompts_tokens_tensor = torch.LongTensor(prompts_tokens)
+    prompts_length_tensor = torch.LongTensor(prompts_length)
+    attention_mask = torch.zeros(prompts_tokens_tensor.shape[:2])
+    for i, l in enumerate(prompts_length_tensor):
+        attention_mask[i, :l] = 1
+    return prompts_tokens_tensor, prompts_length_tensor, attention_mask
+
+
+def _tokenize_prompt(prompt, tokenizer, add_BOS=False, media_info={"<image>": 65, "<|video|>": 65}, **kwargs):
+    media_tokens = {k: -int(i + 1) for i, k in enumerate(media_info.keys())}
+    media_lengths = media_info.copy()
+
+    if add_BOS:
+        prompt_chunk = [tokenizer.bos_token_id]
+    else:
+        prompt_chunk = []
+
+    # Pure Text
+    if all([media_token not in prompt for media_token in media_tokens.keys()]):
+        enc_chunk = prompt_chunk + tokenizer(prompt, add_special_tokens=False, **kwargs)["input_ids"]
+
+    # Multi-Modal Text
+    else:
+        enc_chunk = prompt_chunk
+        pattern = "|".join(map(re.escape, list(media_tokens.keys())))
+        chunk_strs = re.split(f"({pattern})", prompt)
+        chunk_strs = [x for x in chunk_strs if len(x) > 0]
+        for idx, chunk_str in enumerate(chunk_strs):
+            if chunk_str in media_tokens:
+                enc_chunk += [media_tokens[chunk_str]] * media_lengths[chunk_str]
+            else:
+                tmp_chunk = tokenizer(chunk_str, add_special_tokens=False)["input_ids"]
+                # if idx < len(chunk_strs) - 1: # Last chunk should not have eos
+                #     tmp_chunk += [tokenizer.eod_id]
+                enc_chunk += tmp_chunk
+    return enc_chunk
+
+
+if __name__ == "__main__":
+    pass
diff --git a/lmms_eval/models/mplug_owl_video/tokenization_mplug_owl.py b/lmms_eval/models/mplug_owl_video/tokenization_mplug_owl.py
new file mode 100644
index 00000000..22384b44
--- /dev/null
+++ b/lmms_eval/models/mplug_owl_video/tokenization_mplug_owl.py
@@ -0,0 +1,62 @@
+# coding=utf-8
+# Copyright 2022 x-plug and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tokenization classes for MplugOwl."""
+
+from transformers.utils import logging
+from transformers.models.llama.tokenization_llama import LlamaTokenizer
+
+
+logger = logging.get_logger(__name__)
+
+VOCAB_FILES_NAMES = {"vocab_file": "vocab.txt"}
+
+PRETRAINED_VOCAB_FILES_MAP = {
+    "vocab_file": {
+        "MAGAer13/mplug-owl-llama-7b": "https://huggingface.co/MAGAer13/mplug-owl-llama-7b/resolve/main/vocab.txt",
+    },
+}
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+    "MAGAer13/mplug-owl-llama-7b": 2048,
+}
+
+
+class MplugOwlTokenizer(LlamaTokenizer):
+    def __init__(
+        self,
+        vocab_file,
+        unk_token="<unk>",
+        bos_token="<s>",
+        eos_token="</s>",
+        pad_token="<unk>",
+        sp_model_kwargs=None,
+        add_bos_token=False,
+        add_eos_token=False,
+        clean_up_tokenization_spaces=False,
+        **kwargs,
+    ):
+        super().__init__(
+            vocab_file,
+            unk_token,
+            bos_token,
+            eos_token,
+            pad_token,
+            sp_model_kwargs,
+            add_bos_token,
+            add_eos_token,
+            clean_up_tokenization_spaces,
+            **kwargs,
+        )
+        self.eod_id = self.eos_token_id
diff --git a/lmms_eval/models/qwen_vl.py b/lmms_eval/models/qwen_vl.py
old mode 100644
new mode 100755
index 4d9cdbb1..e55ad7c9
--- a/lmms_eval/models/qwen_vl.py
+++ b/lmms_eval/models/qwen_vl.py
@@ -242,12 +242,11 @@ def _collate(x):
             if len(visual_paths) == 0:
                 for context in contexts:
                     query.append({"text": context})
-            else: 
+            else:
                 for visual_path, context in zip(visual_paths, contexts):
                     query.append({"image": visual_path})
                     query.append({"text": context})
 
-
             questions = self.tokenizer.from_list_format(query)
             input_ids = self.tokenizer(questions, return_tensors="pt", padding="longest")
 
diff --git a/lmms_eval/models/reka.py b/lmms_eval/models/reka.py
new file mode 100644
index 00000000..d5e85d5d
--- /dev/null
+++ b/lmms_eval/models/reka.py
@@ -0,0 +1,189 @@
+from PIL import Image
+from io import BytesIO
+from copy import deepcopy
+import numpy as np
+import os
+import base64
+from typing import List, Tuple
+from tqdm import tqdm
+import requests as url_requests
+import time
+import logging
+import json
+
+from lmms_eval.api.instance import Instance
+from lmms_eval.api.model import lmms
+from lmms_eval.api.registry import register_model
+from accelerate import Accelerator, DistributedType
+
+NUM_SECONDS_TO_SLEEP = 30
+eval_logger = logging.getLogger("lmms-eval")
+
+try:
+    from reka.client import Reka as RekaClient
+    from reka import ChatMessage
+    from decord import VideoReader, cpu
+except Exception as e:
+    eval_logger.error(f"Error importing reka: {e}")
+
+
+@register_model("reka")
+class Reka(lmms):
+    def __init__(
+        self,
+        model_version: str = "reka-edge",
+        modality: str = "image",
+        max_frames_for_video: int = 10,
+        timeout: int = 120,
+        continual_mode: bool = False,
+        response_persistent_folder: str = None,  # We will cache the Gemini API response in this path and use it for future requests
+        **kwargs,
+    ) -> None:
+        super().__init__()
+        self.model_version = model_version
+        self.modality = modality
+        self.max_frames_for_video = max_frames_for_video
+        self.timeout = timeout
+        self.continual_mode = continual_mode
+        if self.continual_mode and response_persistent_folder is None:
+            raise ValueError("Continual mode requires a persistent path for the response. Please provide a valid path.")
+        self.response_persistent_folder = response_persistent_folder
+        self.response_persistent_file = os.path.join(self.response_persistent_folder, f"{self.model_version}_response.json")
+
+        if os.path.exists(self.response_persistent_file):
+            with open(self.response_persistent_file, "r") as f:
+                self.response_cache = json.load(f)
+            self.cache_mode = "resume"
+        else:
+            self.response_cache = {}
+            self.cache_mode = "start"
+
+        self.reka = RekaClient(api_key=os.getenv("REKA_API_KEY", "YOUR_API_KEY"))
+
+        accelerator = Accelerator()
+        if accelerator.num_processes > 1:
+            assert accelerator.distributed_type in [DistributedType.FSDP, DistributedType.MULTI_GPU, DistributedType.DEEPSPEED], "Unsupported distributed type provided. Only DDP and FSDP are supported."
+            self.accelerator = accelerator
+            if self.accelerator.is_local_main_process:
+                eval_logger.info(f"Using {accelerator.num_processes} devices with data parallelism")
+            self._rank = self.accelerator.local_process_index
+            self._world_size = self.accelerator.num_processes
+        else:
+            self.accelerator = accelerator
+            self._rank = self.accelerator.local_process_index
+            self._world_size = self.accelerator.num_processes
+
+        self.device = self.accelerator.device
+
+    def encode_image(self, image):
+        if type(image) == list:
+            media_urls = []
+            for img in image:
+                output_buffer = BytesIO()
+                img.save(output_buffer, format="PNG")
+                byte_data = output_buffer.getvalue()
+                base64_str = base64.b64encode(byte_data).decode("utf-8")
+                media_urls.append(f"data:image/jpeg;base64,{base64_str}")
+            return media_urls
+        else:
+            output_buffer = BytesIO()
+            image.save(output_buffer, format="PNG")
+            byte_data = output_buffer.getvalue()
+            base64_str = base64.b64encode(byte_data).decode("utf-8")
+
+            return f"data:image/jpeg;base64,{base64_str}"
+
+    def encode_video(self, video_path):
+        vr = VideoReader(video_path, ctx=cpu(0))
+        total_frame_num = len(vr)
+        uniform_sampled_frames = np.linspace(0, total_frame_num - 1, self.max_frames_for_video, dtype=int)
+        frame_idx = uniform_sampled_frames.tolist()
+        frames = vr.get_batch(frame_idx).asnumpy()
+
+        base64_frames = []
+        for frame in frames:
+            img = Image.fromarray(frame)
+            output_buffer = BytesIO()
+            img.save(output_buffer, format="PNG")
+            byte_data = output_buffer.getvalue()
+            base64_str = base64.b64encode(byte_data).decode("utf-8")
+            base64_frames.append(f"data:image/jpeg;base64,{base64_str}")
+
+        return base64_frames
+
+    def generate_until(self, requests) -> List[str]:
+        res = []
+        pbar = tqdm(total=len(requests), disable=(self.rank != 0), desc="Model Responding")
+
+        for context, gen_kwargs, doc_to_visual, doc_id, task, split in [reg.args for reg in requests]:
+            if self.continual_mode is True and self.cache_mode == "resume":
+                doc_uuid = f"{task}___{split}___{doc_id}"
+                if doc_uuid in self.response_cache:
+                    response_text = self.response_cache[doc_uuid]
+                    if response_text:
+                        res.append(response_text)
+                        pbar.update(1)
+                        continue
+
+            visual = doc_to_visual(self.task_dict[task][split][doc_id])
+
+            message_content = []
+
+            if self.modality == "image":
+                media_urls = self.encode_image(visual)
+                message_content.append({"type": "text", "text": context})
+                for media_url in media_urls:
+                    message_content.append({"type": "image_url", "image_url": media_url})
+            elif self.modality == "video":
+                message_content.append({"type": "text", "text": context})
+                assert len(visual) == 1, "Reka only supports one video per request"
+                media_urls = self.encode_video(visual[0])
+                assert len(media_urls) == self.max_frames_for_video, f"Reka only supports {self.max_frames_for_video} frames per request"
+                for media_url in media_urls:
+                    message_content.append({"type": "image_url", "image_url": media_url})
+
+            if "max_new_tokens" not in gen_kwargs:
+                gen_kwargs["max_new_tokens"] = 1024
+            if "temperature" not in gen_kwargs:
+                gen_kwargs["temperature"] = 0
+            if "top_p" not in gen_kwargs:
+                gen_kwargs["top_p"] = None
+            if "num_beams" not in gen_kwargs:
+                gen_kwargs["num_beams"] = 1
+
+            for attempt in range(5):
+                try:
+                    response = self.reka.chat.create(
+                        messages=[
+                            ChatMessage(
+                                role="user",
+                                content=message_content,
+                            )
+                        ],
+                        model=self.model_version,
+                    )
+                    response_text = response.responses[0].message.content.strip()
+                    break  # If successful, break out of the loop
+
+                except Exception as e:
+                    eval_logger.info(f"Attempt {attempt + 1} failed with error: {str(e)}")
+                    if attempt < 5 - 1:  # If we have retries left, sleep and then continue to next attempt
+                        time.sleep(NUM_SECONDS_TO_SLEEP)
+                    else:  # If this was the last attempt, log and return empty
+                        eval_logger.error(f"All 5 attempts failed. Last error message: {str(e)}")
+                        response_text = ""
+
+            res.append(response_text)
+            pbar.update(1)
+            if self.continual_mode is True:  # Cache the response
+                doc_uuid = f"{task}___{split}___{doc_id}"
+                self.response_cache[doc_uuid] = response_text
+                with open(self.response_persistent_file, "w") as f:
+                    json.dump(self.response_cache, f)
+
+        pbar.close()
+        return res
+
+    def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]:
+        # TODO
+        assert False, "Reka not support loglikelihood"
diff --git a/lmms_eval/models/video_chatgpt.py b/lmms_eval/models/video_chatgpt.py
new file mode 100644
index 00000000..a724cd98
--- /dev/null
+++ b/lmms_eval/models/video_chatgpt.py
@@ -0,0 +1,200 @@
+import os
+from lmms_eval import utils
+from lmms_eval.api.instance import Instance
+from lmms_eval.api.model import lmms
+from lmms_eval.api.registry import register_model
+
+from accelerate import Accelerator, DistributedType, InitProcessGroupKwargs
+from accelerate.state import AcceleratorState
+from huggingface_hub import snapshot_download
+import torch
+from PIL import Image
+
+from datetime import timedelta
+import logging
+from typing import List, Tuple, Optional, Union
+from tqdm import tqdm
+
+try:
+    from lmms_eval.models.video_chatgpt.eval.model_utils import load_video, initialize_model
+    from lmms_eval.models.video_chatgpt.inference import video_chatgpt_infer, video_chatgpt_infer_ppl, get_spatio_temporal_features_torch
+except ImportError:
+    eval_logger = logging.getLogger("lmms-eval")
+    eval_logger.info("Failed to import video_chatgpt modules")
+
+from lmms_eval.models.model_utils.load_video import read_video_pyav
+
+eval_logger = logging.getLogger("lmms-eval")
+
+
+@register_model("video_chatgpt")
+class VideoChatGPT(lmms):
+    def __init__(
+        self,
+        batch_size: Optional[Union[int, str]] = 1,
+        projection_path: str = "MBZUAI/Video-ChatGPT-7B",
+        model_path: str = "mmaaz60/LLaVA-7B-Lightening-v1-1",
+        device_map="cuda:0",
+        device: Optional[str] = "cuda:0",
+        num_frm: Optional[Union[int, str]] = 100,
+    ) -> None:
+        super().__init__()
+        self.batch_size_per_gpu = int(batch_size)
+        self.num_frm = int(num_frm)
+        accelerator_kwargs = InitProcessGroupKwargs(timeout=timedelta(weeks=52))
+        accelerator = Accelerator(kwargs_handlers=[accelerator_kwargs])
+        if accelerator.num_processes > 1:
+            self._device = torch.device(f"cuda:{accelerator.local_process_index}")
+            self.device_map = f"cuda:{accelerator.local_process_index}"
+        elif accelerator.num_processes == 1 and device_map == "auto":
+            self._device = torch.device(device)
+            self.device_map = device_map
+        else:
+            self._device = torch.device(f"cuda:{accelerator.local_process_index}")
+            self.device_map = f"cuda:{accelerator.local_process_index}"
+        try:
+            self.model, self.vision_tower, self.tokenizer, self.image_processor, self.video_token_len = initialize_model(model_path, projection_path, device=self.device)
+        except:
+            eval_logger.info("Does not find the model from the path you provide, try downloading from the hf repo.")
+            model_path = snapshot_download(repo_id=model_path)
+            projection_path = os.path.join(snapshot_download(repo_id=projection_path), "video_chatgpt-7B.bin")
+            self.model, self.vision_tower, self.tokenizer, self.image_processor, self.video_token_len = initialize_model(model_path, projection_path, device=self.device)
+
+        if accelerator.num_processes > 1:
+            assert accelerator.distributed_type in [DistributedType.FSDP, DistributedType.MULTI_GPU, DistributedType.DEEPSPEED], "Unsupported distributed type provided. Only DDP and FSDP are supported."
+            # If you want to use DistributedType.DEEPSPEED, you have to run accelerate config before using the model
+            # Also, you have to select zero stage 0 (equivalent to DDP) in order to make the prepare model works
+            # I tried to set different parameters in the kwargs to let default zero 2 stage works, but it didn't work.
+            if accelerator.distributed_type == DistributedType.DEEPSPEED:
+                kwargs = {
+                    "train_micro_batch_size_per_gpu": self.batch_size_per_gpu,
+                    "train_batch_size": self.batch_size_per_gpu * accelerator.num_processes,
+                }
+                AcceleratorState().deepspeed_plugin.deepspeed_config_process(must_match=True, **kwargs)
+                eval_logger.info("Detected that you are using DistributedType.DEEPSPEED. Make sure you run `accelerate config` and set zero stage to 0")
+            if accelerator.distributed_type == DistributedType.FSDP or accelerator.distributed_type == DistributedType.DEEPSPEED:
+                self._model = accelerator.prepare(self.model)
+            else:
+                self._model = accelerator.prepare_model(self.model, evaluation_mode=True)
+            self.accelerator = accelerator
+            if self.accelerator.is_local_main_process:
+                eval_logger.info(f"Using {accelerator.num_processes} devices with data parallelism")
+            self._rank = self.accelerator.local_process_index
+            self._world_size = self.accelerator.num_processes
+        elif accelerator.num_processes == 1 and device_map == "auto":
+            eval_logger.info(f"Using {accelerator.num_processes} devices with tensor parallelism")
+            self._rank = 0
+            self._word_size = 1
+        else:
+            eval_logger.info(f"Using single device: {self._device}")
+            self.model.to(self._device)
+            self._rank = 0
+            self._world_size = 1
+
+    def flatten(self, input):
+        new_list = []
+        for i in input:
+            for j in i:
+                new_list.append(j)
+        return new_list
+
+    def generate_until(self, requests) -> List[str]:
+        res = []
+        pbar = tqdm(total=len(requests), disable=(self.rank != 0), desc="Model Responding")
+
+        for contexts, gen_kwargs, doc_to_visual, doc_id, task, split in [reg.args for reg in requests]:
+            # encode, pad, and truncate contexts for this batch
+            visuals = [doc_to_visual(self.task_dict[task][split][doc_id])]
+            visuals = self.flatten(visuals)
+            # videos = []
+            for visual in visuals:
+                video_frames = read_video_pyav(visual, num_frm=self.num_frm)
+                target_h, target_w = 224, 224
+                # If image shape is not as target, resize it
+                if video_frames.shape[-3] != target_h or video_frames.shape[-2] != target_w:
+                    video_frames = torch.from_numpy(video_frames).permute(0, 3, 1, 2).float()
+                    video_frames = torch.nn.functional.interpolate(video_frames, size=(target_h, target_w))
+                    video_frames = video_frames.permute(0, 2, 3, 1).to(torch.uint8).numpy()
+                video_frames = [Image.fromarray(frame) for frame in video_frames]
+                if len(video_frames) > self.num_frm:
+                    video_frames = video_frames[: self.num_frm]
+                # VideoChatGPT load video return a list of PIL Image
+                # videos += video_frames
+
+            output = video_chatgpt_infer(
+                video_frames, contexts, conv_mode="video-chatgpt_v1", model=self.model, vision_tower=self.vision_tower, tokenizer=self.tokenizer, image_processor=self.image_processor, video_token_len=self.video_token_len
+            )
+
+            res.append(output)
+            pbar.update(1)
+
+        return res
+
+    def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]:
+        res = []
+        pbar = tqdm(total=len(requests), disable=(self.rank != 0), desc="Model Responding")
+
+        for contexts, doc_to_target, doc_to_visual, doc_id, task, split in [reg.args for reg in requests]:
+            # encode, pad, and truncate contexts for this batch
+            if type(doc_to_target) == str:
+                continuation = doc_to_target
+            else:
+                continuation = doc_to_target(self.task_dict[task][split][doc_id])
+            visuals = [doc_to_visual(self.task_dict[task][split][doc_id])]
+            visuals = self.flatten(visuals)
+            videos = []
+            for visual in visuals:
+                video_frames = load_video(visual, num_frm=self.num_frm)
+                # VideoChatGPT load video return a list of PIL Image
+                videos += video_frames
+            image_tensor = self.image_processor.preprocess(videos, return_tensors="pt")["pixel_values"]
+
+            # Move image tensor to GPU and reduce precision to half
+            image_tensor = image_tensor.half().to(self.device)
+
+            # Generate video spatio-temporal features
+            with torch.no_grad():
+                image_forward_outs = self.vision_tower(image_tensor, output_hidden_states=True)
+                frame_features = image_forward_outs.hidden_states[-2][:, 1:]  # Use second to last layer as in LLaVA
+            video_spatio_temporal_features = get_spatio_temporal_features_torch(frame_features).cuda()
+
+            outputs, input_ids, context_ids = video_chatgpt_infer_ppl(
+                # video_frames,
+                contexts,
+                continuation,
+                conv_mode="video-chatgpt_v1",
+                model=self.model,
+                vision_tower=self.vision_tower,
+                tokenizer=self.tokenizer,
+                image_processor=self.image_processor,
+                video_token_len=self.video_token_len,
+                video_spatio_temporal_features=video_spatio_temporal_features,
+            )
+
+            loss = outputs["loss"]
+            # loss = torch.exp(loss)
+            logits = outputs["logits"]
+            greedy_tokens = logits.argmax(dim=-1)
+            cont_toks = input_ids[:, context_ids.shape[1] :]  # [1, seq]
+            greedy_tokens = greedy_tokens[:, context_ids.shape[1] : input_ids.shape[1]]  # [1, seq]
+            max_equal = (greedy_tokens == cont_toks).all()
+            res.append((float(loss.item()), bool(max_equal)))
+            pbar.update(1)
+        pbar.close()
+        return res
+
+    @property
+    def batch_size(self):
+        return self.batch_size_per_gpu
+
+    @property
+    def device(self):
+        return self._device
+
+    @property
+    def rank(self):
+        return self._rank
+
+    @property
+    def world_size(self):
+        return self._world_size
diff --git a/lmms_eval/models/video_chatgpt/__init__.py b/lmms_eval/models/video_chatgpt/__init__.py
new file mode 100644
index 00000000..c5f48379
--- /dev/null
+++ b/lmms_eval/models/video_chatgpt/__init__.py
@@ -0,0 +1 @@
+from .model import VideoChatGPTLlamaForCausalLM
diff --git a/lmms_eval/models/video_chatgpt/constants.py b/lmms_eval/models/video_chatgpt/constants.py
new file mode 100644
index 00000000..c9ea9ac1
--- /dev/null
+++ b/lmms_eval/models/video_chatgpt/constants.py
@@ -0,0 +1,11 @@
+CONTROLLER_HEART_BEAT_EXPIRATION = 30
+WORKER_HEART_BEAT_INTERVAL = 15
+
+LOGDIR = "."
+
+
+# Defining model
+DEFAULT_VIDEO_TOKEN = "<video>"
+DEFAULT_VIDEO_PATCH_TOKEN = "<vid_patch>"
+DEFAULT_VID_START_TOKEN = "<vid_start>"
+DEFAULT_VID_END_TOKEN = "<vid_end>"
diff --git a/lmms_eval/models/video_chatgpt/eval/__init__.py b/lmms_eval/models/video_chatgpt/eval/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/lmms_eval/models/video_chatgpt/eval/model_utils.py b/lmms_eval/models/video_chatgpt/eval/model_utils.py
new file mode 100644
index 00000000..fc916240
--- /dev/null
+++ b/lmms_eval/models/video_chatgpt/eval/model_utils.py
@@ -0,0 +1,147 @@
+import os
+import numpy as np
+from PIL import Image
+from decord import VideoReader, cpu
+from transformers import AutoTokenizer, CLIPVisionModel, CLIPImageProcessor
+from lmms_eval.models.video_chatgpt.model import VideoChatGPTLlamaForCausalLM
+from lmms_eval.models.video_chatgpt.utils import disable_torch_init
+from lmms_eval.models.video_chatgpt.constants import *
+import torch
+
+
+def load_video(vis_path, n_clips=1, num_frm=100):
+    """
+    Load video frames from a video file.
+
+    Parameters:
+    vis_path (str): Path to the video file.
+    n_clips (int): Number of clips to extract from the video. Defaults to 1.
+    num_frm (int): Number of frames to extract from each clip. Defaults to 100.
+
+    Returns:
+    list: List of PIL.Image.Image objects representing video frames.
+    """
+
+    # Load video with VideoReader
+    vr = VideoReader(vis_path, ctx=cpu(0))
+    total_frame_num = len(vr)
+
+    # Currently, this function supports only 1 clip
+    assert n_clips == 1
+
+    # Calculate total number of frames to extract
+    total_num_frm = min(total_frame_num, num_frm)
+    # Get indices of frames to extract
+    frame_idx = get_seq_frames(total_frame_num, total_num_frm)
+    # Extract frames as numpy array
+    img_array = vr.get_batch(frame_idx).asnumpy()
+    # Set target image height and width
+    target_h, target_w = 224, 224
+    # If image shape is not as target, resize it
+    if img_array.shape[-3] != target_h or img_array.shape[-2] != target_w:
+        img_array = torch.from_numpy(img_array).permute(0, 3, 1, 2).float()
+        img_array = torch.nn.functional.interpolate(img_array, size=(target_h, target_w))
+        img_array = img_array.permute(0, 2, 3, 1).to(torch.uint8).numpy()
+
+    # Reshape array to match number of clips and frames
+    img_array = img_array.reshape((n_clips, total_num_frm, img_array.shape[-3], img_array.shape[-2], img_array.shape[-1]))
+    # Convert numpy arrays to PIL Image objects
+    clip_imgs = [Image.fromarray(img_array[0, j]) for j in range(total_num_frm)]
+
+    return clip_imgs
+
+
+def get_seq_frames(total_num_frames, desired_num_frames):
+    """
+    Calculate the indices of frames to extract from a video.
+
+    Parameters:
+    total_num_frames (int): Total number of frames in the video.
+    desired_num_frames (int): Desired number of frames to extract.
+
+    Returns:
+    list: List of indices of frames to extract.
+    """
+
+    # Calculate the size of each segment from which a frame will be extracted
+    seg_size = float(total_num_frames - 1) / desired_num_frames
+
+    seq = []
+    for i in range(desired_num_frames):
+        # Calculate the start and end indices of each segment
+        start = int(np.round(seg_size * i))
+        end = int(np.round(seg_size * (i + 1)))
+
+        # Append the middle index of the segment to the list
+        seq.append((start + end) // 2)
+
+    return seq
+
+
+def initialize_model(model_name, projection_path=None, device="cuda"):
+    """
+    Initializes the model with given parameters.
+
+    Parameters:
+    model_name (str): Name of the model to initialize.
+    projection_path (str, optional): Path to the projection weights. Defaults to None.
+
+    Returns:
+    tuple: Model, vision tower, tokenizer, image processor, vision config, and video token length.
+    """
+
+    # Disable initial torch operations
+    disable_torch_init()
+
+    # Convert model name to user path
+    model_name = os.path.expanduser(model_name)
+
+    # Load tokenizer
+    tokenizer = AutoTokenizer.from_pretrained(model_name)
+
+    # Load model
+    model = VideoChatGPTLlamaForCausalLM.from_pretrained(model_name, low_cpu_mem_usage=True, torch_dtype=torch.float16, use_cache=True)
+
+    # Load image processor
+    image_processor = CLIPImageProcessor.from_pretrained(model.config.mm_vision_tower, torch_dtype=torch.float16)
+
+    # Set to use start and end tokens for video
+    mm_use_vid_start_end = True
+
+    # Add tokens to tokenizer
+    tokenizer.add_tokens([DEFAULT_VIDEO_PATCH_TOKEN], special_tokens=True)
+    if mm_use_vid_start_end:
+        tokenizer.add_tokens([DEFAULT_VID_START_TOKEN, DEFAULT_VID_END_TOKEN], special_tokens=True)
+
+    # Resize token embeddings of the model
+    model.resize_token_embeddings(len(tokenizer))
+
+    # Load the weights from projection_path after resizing the token_embeddings
+    if projection_path:
+        print(f"Loading weights from {projection_path}")
+        status = model.load_state_dict(torch.load(projection_path, map_location="cpu"), strict=False)
+        if status.unexpected_keys:
+            print(f"Unexpected Keys: {status.unexpected_keys}.\nThe Video-ChatGPT weights are not loaded correctly.")
+        print(f"Weights loaded from {projection_path}")
+
+    # Set model to evaluation mode and move to GPU
+    model = model.eval()
+    model = model.to(device)
+
+    vision_tower_name = "openai/clip-vit-large-patch14"
+
+    # Load vision tower and move to GPU
+    vision_tower = CLIPVisionModel.from_pretrained(vision_tower_name, torch_dtype=torch.float16, low_cpu_mem_usage=True).to(device)
+    vision_tower = vision_tower.eval()
+
+    # Configure vision model
+    vision_config = model.get_model().vision_config
+    vision_config.vid_patch_token = tokenizer.convert_tokens_to_ids([DEFAULT_VIDEO_PATCH_TOKEN])[0]
+    vision_config.use_vid_start_end = mm_use_vid_start_end
+    if mm_use_vid_start_end:
+        vision_config.vid_start_token, vision_config.vid_end_token = tokenizer.convert_tokens_to_ids([DEFAULT_VID_START_TOKEN, DEFAULT_VID_END_TOKEN])
+
+    # Set video token length
+    video_token_len = 356
+
+    return model, vision_tower, tokenizer, image_processor, video_token_len
diff --git a/lmms_eval/models/video_chatgpt/inference.py b/lmms_eval/models/video_chatgpt/inference.py
new file mode 100644
index 00000000..b7669525
--- /dev/null
+++ b/lmms_eval/models/video_chatgpt/inference.py
@@ -0,0 +1,185 @@
+from lmms_eval.models.video_chatgpt.video_conversation import conv_templates, SeparatorStyle
+from lmms_eval.models.video_chatgpt.model.utils import KeywordsStoppingCriteria
+import torch
+
+# Define constants
+DEFAULT_VIDEO_TOKEN = "<video>"
+DEFAULT_VIDEO_PATCH_TOKEN = "<vid_patch>"
+DEFAULT_VID_START_TOKEN = "<vid_start>"
+DEFAULT_VID_END_TOKEN = "<vid_end>"
+
+
+def get_spatio_temporal_features_torch(features):
+    """
+    Computes spatio-temporal features from given features.
+
+    Parameters:
+    features (torch.Tensor): Input features to process.
+
+    Returns:
+    torch.Tensor: Spatio-temporal features.
+    """
+
+    # Extract the dimensions of the features
+    t, s, c = features.shape
+
+    # Compute temporal tokens as the mean along the time axis
+    temporal_tokens = torch.mean(features, dim=1)
+
+    # Padding size calculation
+    padding_size = 100 - t
+
+    # Pad temporal tokens if necessary
+    if padding_size > 0:
+        padding = torch.zeros(padding_size, c, device=features.device)
+        temporal_tokens = torch.cat((temporal_tokens, padding), dim=0)
+
+    # Compute spatial tokens as the mean along the spatial axis
+    spatial_tokens = torch.mean(features, dim=0)
+
+    # Concatenate temporal and spatial tokens and cast to half precision
+    concat_tokens = torch.cat([temporal_tokens, spatial_tokens], dim=0).half()
+
+    return concat_tokens
+
+
+def video_chatgpt_infer(video_frames, question, conv_mode, model, vision_tower, tokenizer, image_processor, video_token_len):
+    """
+    Run inference using the Video-ChatGPT model.
+
+    Parameters:
+    sample : Initial sample
+    video_frames (torch.Tensor): Video frames to process.
+    question (str): The question string.
+    conv_mode: Conversation mode.
+    model: The pretrained Video-ChatGPT model.
+    vision_tower: Vision model to extract video features.
+    tokenizer: Tokenizer for the model.
+    image_processor: Image processor to preprocess video frames.
+    video_token_len (int): The length of video tokens.
+
+    Returns:
+    dict: Dictionary containing the model's output.
+    """
+
+    # Prepare question string for the model
+    if model.get_model().vision_config.use_vid_start_end:
+        qs = question + "\n" + DEFAULT_VID_START_TOKEN + DEFAULT_VIDEO_PATCH_TOKEN * video_token_len + DEFAULT_VID_END_TOKEN
+    else:
+        qs = question + "\n" + DEFAULT_VIDEO_PATCH_TOKEN * video_token_len
+
+    # Prepare conversation prompt
+    conv = conv_templates[conv_mode].copy()
+    conv.append_message(conv.roles[0], qs)
+    conv.append_message(conv.roles[1], None)
+    prompt = conv.get_prompt()
+
+    # Tokenize the prompt
+    inputs = tokenizer([prompt])
+
+    # Preprocess video frames and get image tensor
+    image_tensor = image_processor.preprocess(video_frames, return_tensors="pt")["pixel_values"]
+
+    # Move image tensor to GPU and reduce precision to half
+    image_tensor = image_tensor.half().cuda()
+
+    # Generate video spatio-temporal features
+    with torch.no_grad():
+        image_forward_outs = vision_tower(image_tensor, output_hidden_states=True)
+        frame_features = image_forward_outs.hidden_states[-2][:, 1:]  # Use second to last layer as in LLaVA
+    video_spatio_temporal_features = get_spatio_temporal_features_torch(frame_features)
+
+    # Move inputs to GPU
+    input_ids = torch.as_tensor(inputs.input_ids).cuda()
+
+    # Define stopping criteria for generation
+    stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
+    stopping_criteria = KeywordsStoppingCriteria([stop_str], tokenizer, input_ids)
+
+    # Run model inference
+    with torch.inference_mode():
+        output_ids = model.generate(input_ids, video_spatio_temporal_features=video_spatio_temporal_features.unsqueeze(0), do_sample=True, temperature=0.2, max_new_tokens=1024, stopping_criteria=[stopping_criteria])
+
+    # Check if output is the same as input
+    n_diff_input_output = (input_ids != output_ids[:, : input_ids.shape[1]]).sum().item()
+    if n_diff_input_output > 0:
+        print(f"[Warning] {n_diff_input_output} output_ids are not the same as the input_ids")
+
+    # Decode output tokens
+    outputs = tokenizer.batch_decode(output_ids[:, input_ids.shape[1] :], skip_special_tokens=True)[0]
+
+    # Clean output string
+    outputs = outputs.strip().rstrip(stop_str).strip()
+
+    return outputs
+
+
+def video_chatgpt_infer_ppl(question, continuation, conv_mode, model, vision_tower, tokenizer, image_processor, video_token_len, video_spatio_temporal_features):
+    """
+    Run inference using the Video-ChatGPT model.
+
+    Parameters:
+    sample : Initial sample
+    video_frames (torch.Tensor): Video frames to process.
+    question (str): The question string.
+    conv_mode: Conversation mode.
+    model: The pretrained Video-ChatGPT model.
+    vision_tower: Vision model to extract video features.
+    tokenizer: Tokenizer for the model.
+    image_processor: Image processor to preprocess video frames.
+    video_token_len (int): The length of video tokens.
+
+    Returns:
+    dict: Dictionary containing the model's output.
+    """
+
+    # Prepare question string for the model
+    if model.get_model().vision_config.use_vid_start_end:
+        qs = question + "\n" + DEFAULT_VID_START_TOKEN + DEFAULT_VIDEO_PATCH_TOKEN * video_token_len + DEFAULT_VID_END_TOKEN
+    else:
+        qs = question + "\n" + DEFAULT_VIDEO_PATCH_TOKEN * video_token_len
+
+    # Prepare context prompt
+    conv = conv_templates[conv_mode].copy()
+    conv.append_message(conv.roles[0], qs)
+    conv.append_message(conv.roles[1], None)
+    prompt = conv.get_prompt()
+
+    # Tokenize the prompt
+    context_ids = torch.as_tensor(tokenizer([prompt]).input_ids)
+
+    # Prepare context + continuation prompt
+    conv = conv_templates[conv_mode].copy()
+    conv.append_message(conv.roles[0], qs)
+    conv.append_message(conv.roles[1], continuation)
+    prompt = conv.get_prompt()
+    inputs = tokenizer([prompt])
+
+    """# Preprocess video frames and get image tensor
+    image_tensor = image_processor.preprocess(video_frames, return_tensors='pt')['pixel_values']
+
+    # Move image tensor to GPU and reduce precision to half
+    # image_tensor = image_tensor.half().cuda()
+
+    # Generate video spatio-temporal features
+    with torch.no_grad():
+        image_forward_outs = vision_tower(image_tensor, output_hidden_states=True)
+        frame_features = image_forward_outs.hidden_states[-2][:, 1:] # Use second to last layer as in LLaVA
+    video_spatio_temporal_features = get_spatio_temporal_features_torch(frame_features).cuda()
+    
+    del image_tensor
+    torch.cuda.empty_cache()"""
+    # Move inputs to GPU
+    input_ids = torch.as_tensor(inputs.input_ids).cuda()
+    attention_mask = torch.as_tensor(inputs.attention_mask).cuda()
+    labels = torch.as_tensor(inputs["input_ids"]).clone().cuda()
+    labels[0, : len(context_ids)] = -100
+
+    # Define stopping criteria for generation
+    stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
+    stopping_criteria = KeywordsStoppingCriteria([stop_str], tokenizer, input_ids)
+
+    with torch.inference_mode():
+        output = model(input_ids, video_spatio_temporal_features=video_spatio_temporal_features.unsqueeze(0), attention_mask=attention_mask, labels=labels)
+
+    return output, input_ids, context_ids
diff --git a/lmms_eval/models/video_chatgpt/model/__init__.py b/lmms_eval/models/video_chatgpt/model/__init__.py
new file mode 100644
index 00000000..cfe4416c
--- /dev/null
+++ b/lmms_eval/models/video_chatgpt/model/__init__.py
@@ -0,0 +1 @@
+from lmms_eval.models.video_chatgpt.model.video_chatgpt import VideoChatGPTLlamaForCausalLM, VideoChatGPTConfig
diff --git a/lmms_eval/models/video_chatgpt/model/consolidate.py b/lmms_eval/models/video_chatgpt/model/consolidate.py
new file mode 100644
index 00000000..11911587
--- /dev/null
+++ b/lmms_eval/models/video_chatgpt/model/consolidate.py
@@ -0,0 +1,28 @@
+"""
+Usage:
+python3 -m llava.model.consolidate --src ~/model_weights/llava-7b --dst ~/model_weights/llava-7b_consolidate
+"""
+
+import argparse
+
+import torch
+from transformers import AutoTokenizer, AutoModelForCausalLM
+from lmms_eval.models.video_chatgpt.model import *
+
+
+def consolidate_ckpt(src_path, dst_path):
+    print("Loading model")
+    src_model = AutoModelForCausalLM.from_pretrained(src_path, torch_dtype=torch.float16, low_cpu_mem_usage=True)
+    src_tokenizer = AutoTokenizer.from_pretrained(src_path)
+    src_model.save_pretrained(dst_path)
+    src_tokenizer.save_pretrained(dst_path)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--src", type=str, required=True)
+    parser.add_argument("--dst", type=str, required=True)
+
+    args = parser.parse_args()
+
+    consolidate_ckpt(args.src, args.dst)
diff --git a/lmms_eval/models/video_chatgpt/model/make_delta.py b/lmms_eval/models/video_chatgpt/model/make_delta.py
new file mode 100644
index 00000000..8351177e
--- /dev/null
+++ b/lmms_eval/models/video_chatgpt/model/make_delta.py
@@ -0,0 +1,50 @@
+"""
+Usage:
+python
+"""
+
+import argparse
+
+import torch
+from tqdm import tqdm
+from transformers import AutoTokenizer, AutoModelForCausalLM
+
+
+def make_delta(base_model_path, target_model_path, delta_path, hub_repo_id):
+    print("Loading base model")
+    base = AutoModelForCausalLM.from_pretrained(base_model_path, torch_dtype=torch.float16, low_cpu_mem_usage=True)
+
+    print("Loading target model")
+    target = AutoModelForCausalLM.from_pretrained(target_model_path, torch_dtype=torch.float16, low_cpu_mem_usage=True)
+
+    print("Calculating delta")
+    for name, param in tqdm(target.state_dict().items(), desc="Calculating delta"):
+        if name not in base.state_dict():
+            assert name in ["model.mm_projector.weight", "model.mm_projector.bias"], f"{name} not in base model"
+            continue
+        if param.data.shape == base.state_dict()[name].shape:
+            param.data -= base.state_dict()[name]
+        else:
+            assert name in ["model.embed_tokens.weight", "lm_head.weight"], f"{name} dimension mismatch: {param.data.shape} vs {base.state_dict()[name].shape}"
+            bparam = base.state_dict()[name]
+            param.data[: bparam.shape[0], : bparam.shape[1]] -= bparam
+
+    print("Saving delta")
+    if hub_repo_id:
+        kwargs = {"push_to_hub": True, "repo_id": hub_repo_id}
+    else:
+        kwargs = {}
+    target.save_pretrained(delta_path, **kwargs)
+    target_tokenizer = AutoTokenizer.from_pretrained(target_model_path)
+    target_tokenizer.save_pretrained(delta_path, **kwargs)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--base-model-path", type=str, required=True)
+    parser.add_argument("--target-model-path", type=str, required=True)
+    parser.add_argument("--delta-path", type=str, required=True)
+    parser.add_argument("--hub-repo-id", type=str, default=None)
+    args = parser.parse_args()
+
+    make_delta(args.base_model_path, args.target_model_path, args.delta_path, args.hub_repo_id)
diff --git a/lmms_eval/models/video_chatgpt/model/utils.py b/lmms_eval/models/video_chatgpt/model/utils.py
new file mode 100644
index 00000000..b501f32c
--- /dev/null
+++ b/lmms_eval/models/video_chatgpt/model/utils.py
@@ -0,0 +1,26 @@
+import torch
+from lmms_eval.models.video_chatgpt.model import *
+from transformers import StoppingCriteria
+
+
+class KeywordsStoppingCriteria(StoppingCriteria):
+    def __init__(self, keywords, tokenizer, input_ids):
+        self.keywords = keywords
+        self.keyword_ids = [tokenizer(keyword).input_ids for keyword in keywords]
+        self.keyword_ids = [keyword_id[0] for keyword_id in self.keyword_ids if type(keyword_id) is list and len(keyword_id) == 1]
+        self.tokenizer = tokenizer
+        self.start_len = None
+        self.input_ids = input_ids
+
+    def __call__(self, output_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
+        if self.start_len is None:
+            self.start_len = self.input_ids.shape[1]
+        else:
+            for keyword_id in self.keyword_ids:
+                if output_ids[0, -1] == keyword_id:
+                    return True
+            outputs = self.tokenizer.batch_decode(output_ids[:, self.start_len :], skip_special_tokens=True)[0]
+            for keyword in self.keywords:
+                if keyword in outputs:
+                    return True
+        return False
diff --git a/lmms_eval/models/video_chatgpt/model/video_chatgpt.py b/lmms_eval/models/video_chatgpt/model/video_chatgpt.py
new file mode 100644
index 00000000..df6fee4f
--- /dev/null
+++ b/lmms_eval/models/video_chatgpt/model/video_chatgpt.py
@@ -0,0 +1,282 @@
+from typing import List, Optional, Tuple, Union
+import torch
+import torch.nn as nn
+from torch.nn import CrossEntropyLoss
+from transformers import AutoConfig, AutoModelForCausalLM, LlamaConfig, LlamaModel, LlamaForCausalLM
+from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast
+
+DEFAULT_VIDEO_TOKEN = "<video>"
+DEFAULT_VIDEO_PATCH_TOKEN = "<vid_patch>"
+DEFAULT_VID_START_TOKEN = "<vid_start>"
+DEFAULT_VID_END_TOKEN = "<vid_end>"
+
+
+class VisionConfig:
+    def __init__(self):
+        self.frame_size = 224
+        self.patch_size = 14
+        self.hidden_size = 1024
+        self.use_vid_start_end = None
+        self.vid_start_token = None
+        self.vid_end_token = None
+        self.vid_patch_token = None
+
+
+class VideoChatGPTConfig(LlamaConfig):
+    model_type = "VideoChatGPT"
+
+
+class VideoChatGPTLlamaModel(LlamaModel):
+    config_class = VideoChatGPTConfig
+
+    def __init__(self, config: LlamaConfig, mm_vision_tower=None, mm_hidden_size=None):  # TODO: Remove unused params
+        super(VideoChatGPTLlamaModel, self).__init__(config)
+
+        if hasattr(config, "mm_vision_tower"):
+            self.vision_config = VisionConfig()
+
+        if hasattr(config, "use_mm_proj"):
+            self.mm_projector = nn.Linear(config.mm_hidden_size, config.hidden_size)
+
+    def initialize_vision_modules(self, pretrain_mm_mlp_adapter=None, tune_mm_mlp_adapter=False):
+        vision_config = self.vision_config
+        num_patches = (vision_config.frame_size // vision_config.patch_size) ** 2
+
+        self.config.use_mm_proj = True
+        self.config.mm_hidden_size = vision_config.hidden_size
+
+        if not hasattr(self, "mm_projector"):
+            self.mm_projector = nn.Linear(vision_config.hidden_size, self.config.hidden_size)
+
+        if pretrain_mm_mlp_adapter is not None:
+            mm_projector_weights = torch.load(pretrain_mm_mlp_adapter, map_location="cpu")
+            self.mm_projector.load_state_dict({k.split(".")[-1]: v for k, v in mm_projector_weights.items()})
+
+        return dict(video_token_len=num_patches, vision_config=vision_config)
+
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        video_spatio_temporal_features: Optional[torch.FloatTensor] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPast]:
+        orig_embeds_params = getattr(self, "orig_embeds_params", None)
+        # if orig_embeds_params is not None:
+        #     orig_embeds_params = orig_embeds_params[0]
+        #     with torch.no_grad():
+        #         self.get_input_embeddings().weight.data[:-2] = orig_embeds_params[:-2].data
+
+        if inputs_embeds is None:
+            inputs_embeds = self.embed_tokens(input_ids)
+
+        if (input_ids.shape[1] != 1 or self.training) and video_spatio_temporal_features is not None:
+
+            video_features = self.mm_projector(video_spatio_temporal_features)
+            dummy_video_features = torch.zeros(video_features.shape[1], 1024, device=inputs_embeds.device, dtype=inputs_embeds.dtype)
+            dummy_video_features = self.mm_projector(dummy_video_features)
+
+            new_input_embeds = []
+            cur_video_idx = 0
+            for cur_input_ids, cur_input_embeds in zip(input_ids, inputs_embeds):
+                if (cur_input_ids == self.vision_config.vid_patch_token).sum() == 0:
+                    # Multimodal LLM, but the current sample is not multimodal
+                    cur_input_embeds = cur_input_embeds + (0.0 * dummy_video_features).sum()
+                    new_input_embeds.append(cur_input_embeds)
+                    cur_video_idx += 1
+                    continue
+                if self.vision_config.use_vid_start_end:
+                    if (cur_input_ids == self.vision_config.vid_start_token).sum() != (cur_input_ids == self.vision_config.vid_end_token).sum():
+                        raise ValueError("The number of video start tokens and video end tokens should be the same.")
+                    video_start_tokens = torch.where(cur_input_ids == self.vision_config.vid_start_token)[0]
+                    for video_start_token_pos in video_start_tokens:
+                        cur_video_features = video_features[cur_video_idx].to(device=cur_input_embeds.device)
+                        num_patches = cur_video_features.shape[0]
+                        if cur_input_ids[video_start_token_pos + num_patches + 1] != self.vision_config.vid_end_token:
+                            raise ValueError("The video end token should follow the video start token.")
+                        if orig_embeds_params is not None:
+                            cur_new_input_embeds = torch.cat(
+                                (
+                                    cur_input_embeds[:video_start_token_pos].detach(),
+                                    cur_input_embeds[video_start_token_pos : video_start_token_pos + 1],
+                                    cur_video_features,
+                                    cur_input_embeds[video_start_token_pos + num_patches + 1 : video_start_token_pos + num_patches + 2],
+                                    cur_input_embeds[video_start_token_pos + num_patches + 2 :].detach(),
+                                ),
+                                dim=0,
+                            )
+                        else:
+                            cur_new_input_embeds = torch.cat((cur_input_embeds[: video_start_token_pos + 1], cur_video_features, cur_input_embeds[video_start_token_pos + num_patches + 1 :]), dim=0)
+                        cur_video_idx += 1
+                    new_input_embeds.append(cur_new_input_embeds)
+                else:
+                    cur_video_features = video_features[cur_video_idx]
+                    num_patches = cur_video_features.shape[0]
+                    if (cur_input_ids == self.vision_config.vid_patch_token).sum() != num_patches:
+                        raise ValueError("The number of video patch tokens should be the same as the number of video patches.")
+                    masked_indices = torch.where(cur_input_ids == self.vision_config.vid_patch_token)[0]
+                    mask_index_start = masked_indices[0]
+                    if (masked_indices != torch.arange(mask_index_start, mask_index_start + num_patches, device=masked_indices.device, dtype=masked_indices.dtype)).any():
+                        raise ValueError("The video patch tokens should be consecutive.")
+                    if orig_embeds_params is not None:
+                        cur_new_input_embeds = torch.cat((cur_input_embeds[:mask_index_start].detach(), cur_video_features, cur_input_embeds[mask_index_start + num_patches :].detach()), dim=0)
+                    else:
+                        cur_new_input_embeds = torch.cat((cur_input_embeds[:mask_index_start], cur_video_features, cur_input_embeds[mask_index_start + num_patches :]), dim=0)
+                    new_input_embeds.append(cur_new_input_embeds)
+                    cur_video_idx += 1
+            inputs_embeds = torch.stack(new_input_embeds, dim=0)
+
+        return super(VideoChatGPTLlamaModel, self).forward(
+            input_ids=None,
+            attention_mask=attention_mask,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+
+class VideoChatGPTLlamaForCausalLM(LlamaForCausalLM):
+    config_class = VideoChatGPTConfig
+
+    def __init__(self, config):
+        super(LlamaForCausalLM, self).__init__(config)
+        self.model = VideoChatGPTLlamaModel(config)
+
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_model(self):
+        return self.model
+
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        video_spatio_temporal_features: Optional[torch.FloatTensor] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
+        outputs = self.model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            video_spatio_temporal_features=video_spatio_temporal_features,
+        )
+
+        hidden_states = outputs[0]
+        logits = self.lm_head(hidden_states)
+
+        loss = None
+        if labels is not None:
+            # Shift so that tokens < n predict n
+            shift_logits = logits[..., :-1, :].contiguous()
+            shift_labels = labels[..., 1:].contiguous()
+            # Flatten the tokens
+            loss_fct = CrossEntropyLoss()
+            shift_logits = shift_logits.view(-1, self.config.vocab_size)
+            shift_labels = shift_labels.view(-1)
+            # Enable model/pipeline parallelism
+            shift_labels = shift_labels.to(shift_logits.device)
+            loss = loss_fct(shift_logits, shift_labels)
+
+        if not return_dict:
+            output = (logits,) + outputs[1:]
+            return (loss,) + output if loss is not None else output
+
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+    def prepare_inputs_for_generation(self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs):
+        if past_key_values:
+            input_ids = input_ids[:, -1:]
+
+        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
+        if inputs_embeds is not None and past_key_values is None:
+            model_inputs = {"inputs_embeds": inputs_embeds}
+        else:
+            model_inputs = {"input_ids": input_ids}
+
+        model_inputs.update(
+            {
+                "past_key_values": past_key_values,
+                "use_cache": kwargs.get("use_cache"),
+                "attention_mask": attention_mask,
+                "video_spatio_temporal_features": kwargs.get("video_spatio_temporal_features", None),
+            }
+        )
+        return model_inputs
+
+    def initialize_vision_tokenizer(self, mm_use_vid_start_end, tokenizer, device, tune_mm_mlp_adapter=False, pretrain_mm_mlp_adapter=None):
+        vision_config = self.get_model().vision_config
+        vision_config.use_vid_start_end = mm_use_vid_start_end
+        tokenizer.add_tokens([DEFAULT_VIDEO_PATCH_TOKEN], special_tokens=True)
+        self.resize_token_embeddings(len(tokenizer))
+
+        if mm_use_vid_start_end:
+            num_new_tokens = tokenizer.add_tokens([DEFAULT_VID_START_TOKEN, DEFAULT_VID_END_TOKEN], special_tokens=True)
+            self.resize_token_embeddings(len(tokenizer))
+            vision_config.vid_start_token, vision_config.vid_end_token = tokenizer.convert_tokens_to_ids([DEFAULT_VID_START_TOKEN, DEFAULT_VID_END_TOKEN])
+
+            if num_new_tokens > 0:
+                input_embeddings = self.get_input_embeddings().weight.data
+                output_embeddings = self.get_output_embeddings().weight.data
+
+                input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)
+                output_embeddings_avg = output_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)
+
+                input_embeddings[-num_new_tokens:] = input_embeddings_avg
+                output_embeddings[-num_new_tokens:] = output_embeddings_avg
+
+            if tune_mm_mlp_adapter:
+                self.get_model().orig_embeds_params = [self.get_input_embeddings().weight.data.clone().to(device=device)]
+                for p in self.get_input_embeddings().parameters():
+                    p.requires_grad = True
+                for p in self.get_output_embeddings().parameters():
+                    p.requires_grad = False
+
+            if pretrain_mm_mlp_adapter:
+                mm_projector_weights = torch.load(pretrain_mm_mlp_adapter, map_location="cpu")
+                embed_tokens_weight = mm_projector_weights["model.embed_tokens.weight"]
+                assert num_new_tokens == 2
+                if input_embeddings.shape == embed_tokens_weight.shape:
+                    input_embeddings[-num_new_tokens:] = embed_tokens_weight[-num_new_tokens:]
+                elif embed_tokens_weight.shape[0] == num_new_tokens:
+                    input_embeddings[-num_new_tokens:] = embed_tokens_weight
+                else:
+                    raise ValueError(f"Unexpected embed_tokens_weight shape. Pretrained: {embed_tokens_weight.shape}. " f"Current: {input_embeddings.shape}. Numer of new tokens: {num_new_tokens}.")
+
+        vision_config.vid_patch_token = tokenizer.convert_tokens_to_ids([DEFAULT_VIDEO_PATCH_TOKEN])[0]
+
+
+AutoConfig.register("VideoChatGPT", VideoChatGPTConfig)
+AutoModelForCausalLM.register(VideoChatGPTConfig, VideoChatGPTLlamaForCausalLM)
diff --git a/lmms_eval/models/video_chatgpt/single_video_inference.py b/lmms_eval/models/video_chatgpt/single_video_inference.py
new file mode 100644
index 00000000..96d7cfc4
--- /dev/null
+++ b/lmms_eval/models/video_chatgpt/single_video_inference.py
@@ -0,0 +1,168 @@
+"""
+How to run this file:
+
+cd VideoChatGPT
+python -m video_chatgpt.single_video_inference \
+    --model-name <path of llava weights, for eg "LLaVA-7B-Lightening-v1-1"> \
+    --projection_path <path of projection for eg "video-chatgpt-weights/video_chatgpt-7B.bin"> \
+    --video_path <video_path>
+"""
+
+from lmms_eval.models.video_chatgpt.video_conversation import conv_templates, SeparatorStyle
+from lmms_eval.models.video_chatgpt.model.utils import KeywordsStoppingCriteria
+import torch
+
+# add new packages as below
+from PIL import Image
+from decord import VideoReader, cpu
+from lmms_eval.models.video_chatgpt.eval.model_utils import initialize_model, load_video
+import argparse
+import numpy as np
+import os
+
+# Define constants
+DEFAULT_VIDEO_TOKEN = "<video>"
+DEFAULT_VIDEO_PATCH_TOKEN = "<vid_patch>"
+DEFAULT_VID_START_TOKEN = "<vid_start>"
+DEFAULT_VID_END_TOKEN = "<vid_end>"
+
+
+def get_spatio_temporal_features_torch(features):
+    """
+    Computes spatio-temporal features from given features.
+
+    Parameters:
+    features (torch.Tensor): Input features to process.
+
+    Returns:
+    torch.Tensor: Spatio-temporal features.
+    """
+
+    # Extract the dimensions of the features
+    t, s, c = features.shape
+
+    # Compute temporal tokens as the mean along the time axis
+    temporal_tokens = torch.mean(features, dim=1)
+
+    # Padding size calculation
+    padding_size = 100 - t
+
+    # Pad temporal tokens if necessary
+    if padding_size > 0:
+        padding = torch.zeros(padding_size, c, device=features.device)
+        temporal_tokens = torch.cat((temporal_tokens, padding), dim=0)
+
+    # Compute spatial tokens as the mean along the spatial axis
+    spatial_tokens = torch.mean(features, dim=0)
+
+    # Concatenate temporal and spatial tokens and cast to half precision
+    concat_tokens = torch.cat([temporal_tokens, spatial_tokens], dim=0).half()
+
+    return concat_tokens
+
+
+def video_chatgpt_infer(video_frames, question, conv_mode, model, vision_tower, tokenizer, image_processor, video_token_len):
+    """
+    Run inference using the Video-ChatGPT model.
+
+    Parameters:
+    sample : Initial sample
+    video_frames (torch.Tensor): Video frames to process.
+    question (str): The question string.
+    conv_mode: Conversation mode.
+    model: The pretrained Video-ChatGPT model.
+    vision_tower: Vision model to extract video features.
+    tokenizer: Tokenizer for the model.
+    image_processor: Image processor to preprocess video frames.
+    video_token_len (int): The length of video tokens.
+
+    Returns:
+    dict: Dictionary containing the model's output.
+    """
+
+    # Prepare question string for the model
+    if model.get_model().vision_config.use_vid_start_end:
+        qs = question + "\n" + DEFAULT_VID_START_TOKEN + DEFAULT_VIDEO_PATCH_TOKEN * video_token_len + DEFAULT_VID_END_TOKEN
+    else:
+        qs = question + "\n" + DEFAULT_VIDEO_PATCH_TOKEN * video_token_len
+
+    # Prepare conversation prompt
+    conv = conv_templates[conv_mode].copy()
+    conv.append_message(conv.roles[0], qs)
+    conv.append_message(conv.roles[1], None)
+    prompt = conv.get_prompt()
+
+    # Tokenize the prompt
+    inputs = tokenizer([prompt])
+
+    # Preprocess video frames and get image tensor
+    image_tensor = image_processor.preprocess(video_frames, return_tensors="pt")["pixel_values"]
+
+    # Move image tensor to GPU and reduce precision to half
+    image_tensor = image_tensor.half().cuda()
+
+    # Generate video spatio-temporal features
+    with torch.no_grad():
+        image_forward_outs = vision_tower(image_tensor, output_hidden_states=True)
+        frame_features = image_forward_outs.hidden_states[-2][:, 1:]  # Use second to last layer as in LLaVA
+    video_spatio_temporal_features = get_spatio_temporal_features_torch(frame_features)
+
+    # Move inputs to GPU
+    input_ids = torch.as_tensor(inputs.input_ids).cuda()
+
+    # Define stopping criteria for generation
+    stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
+    stopping_criteria = KeywordsStoppingCriteria([stop_str], tokenizer, input_ids)
+
+    # Run model inference
+    with torch.inference_mode():
+        output_ids = model.generate(input_ids, video_spatio_temporal_features=video_spatio_temporal_features.unsqueeze(0), do_sample=True, temperature=0.2, max_new_tokens=1024, stopping_criteria=[stopping_criteria])
+
+    # Check if output is the same as input
+    n_diff_input_output = (input_ids != output_ids[:, : input_ids.shape[1]]).sum().item()
+    if n_diff_input_output > 0:
+        print(f"[Warning] {n_diff_input_output} output_ids are not the same as the input_ids")
+
+    # Decode output tokens
+    outputs = tokenizer.batch_decode(output_ids[:, input_ids.shape[1] :], skip_special_tokens=True)[0]
+
+    # Clean output string
+    outputs = outputs.strip().rstrip(stop_str).strip()
+
+    return outputs
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description="Demo")
+
+    parser.add_argument("--model-name", type=str, required=True)
+    parser.add_argument("--vision_tower_name", type=str, default="openai/clip-vit-large-patch14")
+    parser.add_argument("--projection_path", type=str, required=False, default="")
+    parser.add_argument("--video_path", type=str, required=True, default="")
+    parser.add_argument("--conv_mode", type=str, required=False, default="video-chatgpt_v1")
+
+    args = parser.parse_args()
+
+    return args
+
+
+if __name__ == "__main__":
+    args = parse_args()
+
+    model, vision_tower, tokenizer, image_processor, video_token_len = initialize_model(args.model_name, args.projection_path)
+
+    video_path = args.video_path
+
+    if os.path.exists(video_path):
+        video_frames = load_video(video_path)
+
+    question = input("Enter a question to check from the video:")
+    conv_mode = args.conv_mode
+
+    try:
+        # Run inference on the video and add the output to the list
+        output = video_chatgpt_infer(video_frames, question, conv_mode, model, vision_tower, tokenizer, image_processor, video_token_len)
+        print("\n\n", output)
+
+    except Exception as e:
+        print(f"Error processing video file '{video_path}': {e}")
diff --git a/lmms_eval/models/video_chatgpt/utils.py b/lmms_eval/models/video_chatgpt/utils.py
new file mode 100644
index 00000000..eb6fa78e
--- /dev/null
+++ b/lmms_eval/models/video_chatgpt/utils.py
@@ -0,0 +1,125 @@
+import logging
+import logging.handlers
+import os
+import sys
+
+import requests
+
+from lmms_eval.models.video_chatgpt.constants import LOGDIR
+
+server_error_msg = "**NETWORK ERROR DUE TO HIGH TRAFFIC. PLEASE REGENERATE OR REFRESH THIS PAGE.**"
+moderation_msg = "YOUR INPUT VIOLATES OUR CONTENT MODERATION GUIDELINES. PLEASE TRY AGAIN."
+
+handler = None
+
+
+def build_logger(logger_name, logger_filename):
+    global handler
+
+    formatter = logging.Formatter(
+        fmt="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
+        datefmt="%Y-%m-%d %H:%M:%S",
+    )
+
+    # Set the format of root handlers
+    if not logging.getLogger().handlers:
+        logging.basicConfig(level=logging.INFO)
+    logging.getLogger().handlers[0].setFormatter(formatter)
+
+    # Redirect stdout and stderr to loggers
+    stdout_logger = logging.getLogger("stdout")
+    stdout_logger.setLevel(logging.INFO)
+    sl = StreamToLogger(stdout_logger, logging.INFO)
+    sys.stdout = sl
+
+    stderr_logger = logging.getLogger("stderr")
+    stderr_logger.setLevel(logging.ERROR)
+    sl = StreamToLogger(stderr_logger, logging.ERROR)
+    sys.stderr = sl
+
+    # Get logger
+    logger = logging.getLogger(logger_name)
+    logger.setLevel(logging.INFO)
+
+    # Add a file handler for all loggers
+    if handler is None:
+        os.makedirs(LOGDIR, exist_ok=True)
+        filename = os.path.join(LOGDIR, logger_filename)
+        handler = logging.handlers.TimedRotatingFileHandler(filename, when="D", utc=True)
+        handler.setFormatter(formatter)
+
+        for name, item in logging.root.manager.loggerDict.items():
+            if isinstance(item, logging.Logger):
+                item.addHandler(handler)
+
+    return logger
+
+
+class StreamToLogger(object):
+    """
+    Fake file-like stream object that redirects writes to a logger instance.
+    """
+
+    def __init__(self, logger, log_level=logging.INFO):
+        self.terminal = sys.stdout
+        self.logger = logger
+        self.log_level = log_level
+        self.linebuf = ""
+
+    def __getattr__(self, attr):
+        return getattr(self.terminal, attr)
+
+    def write(self, buf):
+        temp_linebuf = self.linebuf + buf
+        self.linebuf = ""
+        for line in temp_linebuf.splitlines(True):
+            # From the io.TextIOWrapper docs:
+            #   On output, if newline is None, any '\n' characters written
+            #   are translated to the system default line separator.
+            # By default sys.stdout.write() expects '\n' newlines and then
+            # translates them so this is still cross platform.
+            if line[-1] == "\n":
+                self.logger.log(self.log_level, line.rstrip())
+            else:
+                self.linebuf += line
+
+    def flush(self):
+        if self.linebuf != "":
+            self.logger.log(self.log_level, self.linebuf.rstrip())
+        self.linebuf = ""
+
+
+def disable_torch_init():
+    """
+    Disable the redundant torch default initialization to accelerate model creation.
+    """
+    import torch
+
+    setattr(torch.nn.Linear, "reset_parameters", lambda self: None)
+    setattr(torch.nn.LayerNorm, "reset_parameters", lambda self: None)
+
+
+def violates_moderation(text):
+    """
+    Check whether the text violates OpenAI moderation API.
+    """
+    url = "https://api.openai.com/v1/moderations"
+    headers = {"Content-Type": "application/json", "Authorization": "Bearer " + os.environ["OPENAI_API_KEY"]}
+    text = text.replace("\n", "")
+    data = "{" + '"input": ' + f'"{text}"' + "}"
+    data = data.encode("utf-8")
+    try:
+        ret = requests.post(url, headers=headers, data=data, timeout=5)
+        flagged = ret.json()["results"][0]["flagged"]
+    except requests.exceptions.RequestException as e:
+        flagged = False
+    except KeyError as e:
+        flagged = False
+
+    return flagged
+
+
+def pretty_print_semaphore(semaphore):
+    if semaphore is None:
+        return "None"
+    return f"Semaphore(value={semaphore._value}, locked={semaphore.locked()})"
diff --git a/lmms_eval/models/video_chatgpt/video_conversation.py b/lmms_eval/models/video_chatgpt/video_conversation.py
new file mode 100644
index 00000000..121a6b81
--- /dev/null
+++ b/lmms_eval/models/video_chatgpt/video_conversation.py
@@ -0,0 +1,156 @@
+import dataclasses
+from enum import auto, Enum
+from typing import List
+from lmms_eval.models.video_chatgpt.eval.model_utils import load_video
+
+
+class SeparatorStyle(Enum):
+    """Different separator style."""
+
+    SINGLE = auto()
+    TWO = auto()
+    MPT = auto()
+
+
+@dataclasses.dataclass
+class Conversation:
+    """A class that keeps all conversation history."""
+
+    system: str
+    roles: List[str]
+    messages: List[List[str]]
+    offset: int
+    sep_style: SeparatorStyle = SeparatorStyle.SINGLE
+    sep: str = "###"
+    sep2: str = None
+    version: str = "Unknown"
+
+    skip_next: bool = False
+
+    def get_prompt(self):
+        if self.sep_style == SeparatorStyle.SINGLE:
+            ret = self.system + self.sep
+            for role, message in self.messages:
+                if message:
+                    if type(message) is tuple:
+                        message, _ = message
+                    ret += role + ": " + message + self.sep
+                else:
+                    ret += role + ":"
+            return ret
+        elif self.sep_style == SeparatorStyle.TWO:
+            seps = [self.sep, self.sep2]
+            ret = self.system + seps[0]
+            for i, (role, message) in enumerate(self.messages):
+                if message:
+                    if type(message) is tuple:
+                        message, _ = message
+                    ret += role + ": " + message + seps[i % 2]
+                else:
+                    ret += role + ":"
+            return ret
+        if self.sep_style == SeparatorStyle.MPT:
+            ret = self.system + self.sep
+            for role, message in self.messages:
+                if message:
+                    if type(message) is tuple:
+                        message, _ = message
+                    ret += role + message + self.sep
+                else:
+                    ret += role
+            return ret
+        else:
+            raise ValueError(f"Invalid style: {self.sep_style}")
+
+    def append_message(self, role, message):
+        self.messages.append([role, message])
+
+    def get_video_frames(self, n_clips=1, num_frm=100):
+        video_frames = []
+        for i, (role, msg) in enumerate(self.messages[self.offset :]):
+            if i % 2 == 0:
+                if type(msg) is tuple:
+                    msg, video_path = msg
+
+                    clip_imgs = load_video(video_path, n_clips, num_frm)
+
+                    for image in clip_imgs:
+                        video_frames.append(image)
+        return video_frames
+
+    def to_gradio_chatbot(self):
+        ret = []
+        for i, (role, msg) in enumerate(self.messages[self.offset :]):
+            if i % 2 == 0:
+                if type(msg) is tuple:
+                    msg, image = msg
+                ret.append([msg, None])
+            else:
+                ret[-1][-1] = msg
+        # Hack to make the demo work
+        try:
+            if "<video>" in ret[0][0]:
+                ret[0][0] = ret[0][0].replace("<video>", "")
+        except Exception as e:
+            pass
+
+        return ret
+
+    def copy(self):
+        return Conversation(system=self.system, roles=self.roles, messages=[[x, y] for x, y in self.messages], offset=self.offset, sep_style=self.sep_style, sep=self.sep, sep2=self.sep2)
+
+    def dict(self):
+        return {
+            "system": self.system,
+            "roles": self.roles,
+            "messages": self.messages,
+            "offset": self.offset,
+            "sep": self.sep,
+            "sep2": self.sep2,
+        }
+
+
+conv_v1_2 = Conversation(
+    system="A chat between a curious human and an artificial intelligence assistant. " "The assistant gives helpful, detailed, and polite answers to the human's questions.",
+    roles=("Human", "Assistant"),
+    messages=(("Human", "What are the key differences between renewable and non-renewable energy sources?"), ("Assistant", "Renewable energy sources are those that can be replenished naturally.\n")),
+    offset=2,
+    sep_style=SeparatorStyle.SINGLE,
+    sep="###",
+)
+
+
+conv_vicuna_v1_1 = Conversation(
+    system="A chat between a curious user and an artificial intelligence assistant. " "The assistant gives helpful, detailed, and polite answers to the user's questions.",
+    roles=("USER", "ASSISTANT"),
+    version="v1",
+    messages=(),
+    offset=0,
+    sep_style=SeparatorStyle.TWO,
+    sep=" ",
+    sep2="</s>",
+)
+
+conv_video_chatgpt_v1 = Conversation(
+    system="You are Video-ChatGPT, a large vision-language assistant. "
+    "You are able to understand the video content that the user provides, and assist the user with a variety of tasks using natural language."
+    "Follow the instructions carefully and explain your answers in detail based on the provided video.",
+    # system="",
+    roles=("USER", "ASSISTANT"),
+    version="v1",
+    messages=(),
+    offset=0,
+    sep_style=SeparatorStyle.TWO,
+    sep=" ",
+    sep2="</s>",
+)
+
+default_conversation = conv_v1_2
+conv_templates = {
+    "default": conv_v1_2,
+    "video-chatgpt_v1": conv_video_chatgpt_v1,
+    "vicuna_v1_1": conv_vicuna_v1_1,
+}
+
+if __name__ == "__main__":
+    print(default_conversation.get_prompt())
diff --git a/lmms_eval/models/video_llava.py b/lmms_eval/models/video_llava.py
new file mode 100644
index 00000000..98acebfe
--- /dev/null
+++ b/lmms_eval/models/video_llava.py
@@ -0,0 +1,219 @@
+import logging
+from accelerate import Accelerator, DistributedType, InitProcessGroupKwargs
+from accelerate.state import AcceleratorState
+from typing import List, Optional, Union, Tuple
+import torch
+from tqdm import tqdm
+import numpy as np
+import math
+from datetime import timedelta
+from transformers import AutoConfig
+
+from lmms_eval import utils
+from lmms_eval.api.instance import Instance
+from lmms_eval.api.model import lmms
+from lmms_eval.api.registry import register_model
+from lmms_eval.utils import stop_sequences_criteria
+
+eval_logger = logging.getLogger("lmms-eval")
+
+# try:
+#     import torch
+#     from videollava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
+#     from videollava.conversation import conv_templates, SeparatorStyle
+#     from videollava.model.builder import load_pretrained_model
+#     from videollava.utils import disable_torch_init
+#     from videollava.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria
+# except ImportError:
+#     eval_logger.debug("Video-LLaVA is not installed. Please install Video-LLaVA to use this model.")
+
+from transformers import VideoLlavaProcessor, VideoLlavaForConditionalGeneration
+from lmms_eval.models.model_utils.load_video import read_video_pyav
+
+
+@register_model("video_llava")
+class VideoLLaVA(lmms):
+    def __init__(
+        self,
+        pretrained: str = "LanguageBind/Video-LLaVA-7B-hf",
+        truncation: Optional[bool] = True,
+        device: Optional[str] = "cuda:0",
+        dtype: Optional[Union[str, torch.dtype]] = "auto",
+        batch_size: Optional[Union[int, str]] = 1,
+        trust_remote_code: Optional[bool] = False,
+        revision=None,
+        attn_implementation=(
+            "sdpa" if torch.__version__ > "2.1.2" else "eager"
+        ),  # inference implementation for attention, can be "sdpa", "eager", "flash_attention_2". Seems FA2 is not effective during inference: https://discuss.huggingface.co/t/flash-attention-has-no-effect-on-inference/73453/5
+        device_map="cuda:0",
+        conv_template="llava_v1",
+        use_cache=True,
+        truncate_context=False,
+        num_frames: int = 8,  # whether to truncate the context in generation, set it False for LLaVA-1.6
+        **kwargs,
+    ) -> None:
+        super().__init__()
+        accelerator_kwargs = InitProcessGroupKwargs(timeout=timedelta(weeks=52))
+        accelerator = Accelerator(kwargs_handlers=[accelerator_kwargs])
+        if accelerator.num_processes > 1:
+            self._device = torch.device(f"cuda:{accelerator.local_process_index}")
+            self.device_map = f"cuda:{accelerator.local_process_index}"
+        elif accelerator.num_processes == 1 and device_map == "auto":
+            self._device = torch.device(device)
+            self.device_map = device_map
+        else:
+            self._device = torch.device(f"cuda:{accelerator.local_process_index}")
+            self.device_map = f"cuda:{accelerator.local_process_index}"
+
+        self.pretrained = pretrained
+        self._model = VideoLlavaForConditionalGeneration.from_pretrained(pretrained)
+        self._processor = VideoLlavaProcessor.from_pretrained(pretrained)
+        self.prompt = "USER: <video>{}? ASSISTANT:"
+        self.num_frames = num_frames
+        assert num_frames == 8, "num_frames must be 8 https://github.com/huggingface/transformers/blob/bdb9106f247fca48a71eb384be25dbbd29b065a8/src/transformers/models/video_llava/modeling_video_llava.py#L379"
+        # self.model_name = get_model_name_from_path(pretrained)
+        # self._tokenizer, self._model, self.processor, self._max_length = load_pretrained_model(pretrained, None, self.model_name, device_map=self.device_map)
+        # self.video_processor = self.processor["video"]
+        self._config = self._model.config
+        self.model.eval()
+        self.model.tie_weights()
+        self.truncation = truncation
+        self.batch_size_per_gpu = int(batch_size)
+        self.conv_template = conv_template
+        self.use_cache = use_cache
+        self.truncate_context = truncate_context
+        # assert self.batch_size_per_gpu == 1, "Llava currently does not support batched generation. See https://github.com/haotian-liu/LLaVA/issues/754. HF Llava also has this issue."
+        if accelerator.num_processes > 1:
+            assert accelerator.distributed_type in [DistributedType.FSDP, DistributedType.MULTI_GPU, DistributedType.DEEPSPEED], "Unsupported distributed type provided. Only DDP and FSDP are supported."
+            # If you want to use DistributedType.DEEPSPEED, you have to run accelerate config before using the model
+            # Also, you have to select zero stage 0 (equivalent to DDP) in order to make the prepare model works
+            # I tried to set different parameters in the kwargs to let default zero 2 stage works, but it didn't work.
+            if accelerator.distributed_type == DistributedType.DEEPSPEED:
+                kwargs = {
+                    "train_micro_batch_size_per_gpu": self.batch_size_per_gpu,
+                    "train_batch_size": self.batch_size_per_gpu * accelerator.num_processes,
+                }
+                AcceleratorState().deepspeed_plugin.deepspeed_config_process(must_match=True, **kwargs)
+                eval_logger.info("Detected that you are using DistributedType.DEEPSPEED. Make sure you run `accelerate config` and set zero stage to 0")
+            if accelerator.distributed_type == DistributedType.FSDP or accelerator.distributed_type == DistributedType.DEEPSPEED:
+                self._model = accelerator.prepare(self.model)
+            else:
+                self._model = accelerator.prepare_model(self.model, evaluation_mode=True)
+            self.accelerator = accelerator
+            if self.accelerator.is_local_main_process:
+                eval_logger.info(f"Using {accelerator.num_processes} devices with data parallelism")
+            self._rank = self.accelerator.local_process_index
+            self._world_size = self.accelerator.num_processes
+        elif accelerator.num_processes == 1 and device_map == "auto":
+            eval_logger.info(f"Using {accelerator.num_processes} devices with tensor parallelism")
+            self._rank = 0
+            self._word_size = 1
+        else:
+            eval_logger.info(f"Using single device: {self._device}")
+            self.model.to(self._device)
+            self._rank = 0
+            self._world_size = 1
+
+    @property
+    def config(self):
+        # return the associated transformers.AutoConfig for the given pretrained model.
+        return self._config
+
+    @property
+    def tokenizer(self):
+        return self._tokenizer
+
+    @property
+    def model(self):
+        # returns the model, unwrapping it if using Accelerate
+        if hasattr(self, "accelerator"):
+            return self.accelerator.unwrap_model(self._model)
+        else:
+            return self._model
+
+    @property
+    def eot_token_id(self):
+        # we use EOT because end of *text* is more accurate for what we're doing than end of *sentence*
+        return self.tokenizer.eos_token_id
+
+    @property
+    def max_length(self):
+        return self._max_length
+
+    def pad_sequence(self, input_ids, batch_first, padding_value):
+        if self.tokenizer.padding_side == "left":
+            input_ids = [torch.flip(_input_ids, [0]) for _input_ids in input_ids]
+        input_ids = torch.nn.utils.rnn.pad_sequence(input_ids, batch_first=batch_first, padding_value=padding_value)
+        if self.tokenizer.padding_side == "left":
+            input_ids = torch.flip(input_ids, [1])
+        return input_ids
+
+    @property
+    def batch_size(self):
+        return self.batch_size_per_gpu
+
+    @property
+    def device(self):
+        return self._device
+
+    @property
+    def rank(self):
+        return self._rank
+
+    @property
+    def world_size(self):
+        return self._world_size
+
+    def tok_encode(self, string: str, left_truncate_len=None, add_special_tokens=None) -> List[int]:
+        """ """
+        add_special_tokens = False if add_special_tokens is None else add_special_tokens
+        encoding = self.tokenizer.encode(string, add_special_tokens=add_special_tokens)
+        # left-truncate the encoded context to be at most `left_truncate_len` tokens long
+        if left_truncate_len:
+            encoding = encoding[-left_truncate_len:]
+        return encoding
+
+    def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]:
+        return super().loglikelihood(requests)
+
+    def flatten(self, input):
+        new_list = []
+        for i in input:
+            for j in i:
+                new_list.append(j)
+        return new_list
+
+    def generate_until(self, requests) -> List[str]:
+        res = []
+        pbar = tqdm(total=len(requests), disable=(self.rank != 0), desc="Model Responding")
+
+        for contexts, gen_kwargs, doc_to_visual, doc_id, task, split in [reg.args for reg in requests]:
+            # encode, pad, and truncate contexts for this batch
+            visuals = [doc_to_visual(self.task_dict[task][split][doc_id])]
+            visuals = self.flatten(visuals)
+            assert len(visuals) == 1
+            clip = read_video_pyav(visuals[0], self.num_frames)
+
+            inputs = self._processor(text=self.prompt.format(contexts), videos=clip, return_tensors="pt")
+            pixel_values_videos = inputs["pixel_values_videos"]
+            if pixel_values_videos.shape[1] != self.num_frames:
+                empty_frames = torch.zeros((1, self.num_frames - pixel_values_videos.shape[1], *pixel_values_videos.shape[2:]), dtype=pixel_values_videos.dtype)
+                pixel_values_videos = torch.cat([pixel_values_videos, empty_frames], dim=1)
+                inputs["pixel_values_videos"] = pixel_values_videos
+            inputs = {k: v.to(self.device) for k, v in inputs.items()}
+
+            if "max_new_tokens" not in gen_kwargs:
+                gen_kwargs["max_new_tokens"] = 1024
+            if "temperature" not in gen_kwargs:
+                gen_kwargs["temperature"] = 0
+            if "top_p" not in gen_kwargs:
+                gen_kwargs["top_p"] = None
+            if "num_beams" not in gen_kwargs:
+                gen_kwargs["num_beams"] = 1
+
+            generate_ids = self.model.generate(**inputs, max_length=gen_kwargs["max_new_tokens"], temperature=gen_kwargs["temperature"])
+
+            outputs = self._processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0].split("ASSISTANT:")[-1].strip()
+            res.append(outputs)
+            pbar.update(1)
+        return res
diff --git a/lmms_eval/models/xcomposer2_4khd.py b/lmms_eval/models/xcomposer2_4khd.py
new file mode 100644
index 00000000..b43f12e4
--- /dev/null
+++ b/lmms_eval/models/xcomposer2_4khd.py
@@ -0,0 +1,295 @@
+from multiprocessing import context
+import torch
+from transformers import AutoModel, AutoTokenizer
+from PIL import Image
+import numpy as np
+import torchvision.transforms as transforms
+from datetime import timedelta
+import logging
+
+from lmms_eval import utils
+from lmms_eval.api.instance import Instance
+from lmms_eval.api.model import lmms
+from lmms_eval.api.registry import register_model
+from lmms_eval.utils import stop_sequences_criteria
+
+from accelerate import Accelerator, DistributedType, InitProcessGroupKwargs
+from accelerate.state import AcceleratorState
+
+from typing import Optional, Sequence, List, Tuple, Union
+import re
+from tqdm import tqdm
+
+pattern = re.compile(r"[A-Z]")
+
+eval_logger = logging.getLogger("lmms-eval")
+
+meta_instruction = """You are an AI assistant whose name is InternLM-XComposer (浦语·灵笔).
+- InternLM-XComposer (浦语·灵笔) is a multi-modality conversational language model that is developed\
+ by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be helpful, honest, and harmless.
+- InternLM-XComposer (浦语·灵笔) can understand and communicate fluently in the language chosen by\
+ the user such as English and 中文.
+- InternLM-XComposer (浦语·灵笔) is capable of comprehending and articulating responses\
+ effectively based on the provided image."""
+
+
+@register_model("xcomposer2_4khd")
+class XComposer2_4KHD(lmms):
+    def __init__(
+        self,
+        pretrained: str = "internlm/internlm-xcomposer2-4khd-7b",
+        device: Optional[str] = "cuda:0",
+        batch_size: Optional[Union[int, str]] = 1,
+        device_map="cuda:0",
+        need_bos: bool = True,
+        padding: bool = False,
+        half: bool = False,
+        **kwargs,
+    ) -> None:
+        super().__init__()
+
+        accelerator_kwargs = InitProcessGroupKwargs(timeout=timedelta(weeks=52))
+        accelerator = Accelerator(kwargs_handlers=[accelerator_kwargs])
+        if accelerator.num_processes > 1:
+            self._device = torch.device(f"cuda:{accelerator.local_process_index}")
+            self.device_map = f"cuda:{accelerator.local_process_index}"
+        elif accelerator.num_processes == 1 and device_map == "auto":
+            self._device = torch.device(device)
+            self.device_map = device_map
+        else:
+            self._device = torch.device(f"cuda:{accelerator.local_process_index}")
+            self.device_map = f"cuda:{accelerator.local_process_index}"
+
+        self.pretrained = pretrained
+        self.need_bos = need_bos
+        self.padding = padding
+        self._model = AutoModel.from_pretrained(self.pretrained, device_map=self.device_map, trust_remote_code=True)
+        self._tokenizer = AutoTokenizer.from_pretrained(self.pretrained, trust_remote_code=True)
+        self.model.tokenizer = self.tokenizer
+        self.batch_size_per_gpu = batch_size
+
+        if accelerator.num_processes > 1:
+            assert accelerator.distributed_type in [DistributedType.FSDP, DistributedType.MULTI_GPU, DistributedType.DEEPSPEED], "Unsupported distributed type provided. Only DDP and FSDP are supported."
+            # If you want to use DistributedType.DEEPSPEED, you have to run accelerate config before using the model
+            # Also, you have to select zero stage 0 (equivalent to DDP) in order to make the prepare model works
+            # I tried to set different parameters in the kwargs to let default zero 2 stage works, but it didn't work.
+            if accelerator.distributed_type == DistributedType.DEEPSPEED:
+                kwargs = {
+                    "train_micro_batch_size_per_gpu": self.batch_size_per_gpu,
+                    "train_batch_size": self.batch_size_per_gpu * accelerator.num_processes,
+                }
+                AcceleratorState().deepspeed_plugin.deepspeed_config_process(must_match=True, **kwargs)
+                eval_logger.info("Detected that you are using DistributedType.DEEPSPEED. Make sure you run `accelerate config` and set zero stage to 0")
+            if accelerator.distributed_type == DistributedType.FSDP or accelerator.distributed_type == DistributedType.DEEPSPEED:
+                self._model = accelerator.prepare(self.model)
+            else:
+                self._model = accelerator.prepare_model(self.model, evaluation_mode=True)
+            self.accelerator = accelerator
+            if self.accelerator.is_local_main_process:
+                eval_logger.info(f"Using {accelerator.num_processes} devices with data parallelism")
+            self._rank = self.accelerator.local_process_index
+            self._world_size = self.accelerator.num_processes
+        elif accelerator.num_processes == 1 and device_map == "auto":
+            eval_logger.info(f"Using {accelerator.num_processes} devices with tensor parallelism")
+            self._rank = 0
+            self._word_size = 1
+        else:
+            eval_logger.info(f"Using single device: {self._device}")
+            self.model.to(self._device)
+            self._rank = 0
+            self._world_size = 1
+
+    @property
+    def config(self):
+        # return the associated transformers.AutoConfig for the given pretrained model.
+        return self._config
+
+    @property
+    def tokenizer(self):
+        return self._tokenizer
+
+    @property
+    def model(self):
+        # returns the model, unwrapping it if using Accelerate
+        if hasattr(self, "accelerator"):
+            return self.accelerator.unwrap_model(self._model)
+        else:
+            return self._model
+
+    @property
+    def batch_size(self):
+        return self.batch_size_per_gpu
+
+    @property
+    def device(self):
+        return self._device
+
+    @property
+    def rank(self):
+        return self._rank
+
+    @property
+    def world_size(self):
+        return self._world_size
+
+    def flatten(self, input):
+        new_list = []
+        for i in input:
+            for j in i:
+                new_list.append(j)
+        return new_list
+
+    def generate_until(self, requests) -> List[str]:
+        res = []
+        pbar = tqdm(total=len(requests), disable=(self.rank != 0), desc="Model Responding")
+
+        for contexts, gen_kwargs, doc_to_visual, doc_id, task, split in [reg.args for reg in requests]:
+            # encode, pad, and truncate contexts for this batch
+            if "[UNUSED_TOKEN_146]" not in contexts:
+                contexts = f"[UNUSED_TOKEN_146]user\n{contexts}[UNUSED_TOKEN_145]\n[UNUSED_TOKEN_146]assistant\n"
+            visuals = [doc_to_visual(self.task_dict[task][split][doc_id])]
+            visuals = self.flatten(visuals)
+
+            if "hd_num" not in gen_kwargs:
+                if listinstr(["docvqa_test", "infovqa_test"], task.lower()):
+                    self.model.hd_num = 65
+                elif listinstr(["docvqa_val", "infovqa_val", "OCRBench"], task.lower()):
+                    self.model.hd_num = 55
+                elif listinstr(["mmmu", "mmbench", "mmvet"], task.lower()):
+                    self.model.hd_num = 16
+                else:
+                    self.model.hd_num = 25
+            else:
+                self.model.hd_num = gen_kwargs.pop("hd_num")
+
+            pt1 = 0
+            embeds = []
+            im_mask = []
+            images_loc = [0]
+            need_bos = self.need_bos
+            padding = self.padding
+            for i, pts in enumerate(images_loc + [len(contexts)]):
+                subtext = contexts[pt1:pts]
+                if need_bos or len(subtext) > 0:
+                    text_embeds = self.model.encode_text(subtext, add_special_tokens=need_bos).to(self.device)
+                    embeds.append(text_embeds)
+                    im_mask.append(torch.zeros(text_embeds.shape[:2]).to(self.device))
+                    need_bos = False
+                if i < len(visuals):
+                    image = visuals[i]
+
+                    image = HD_transform(image, im_num=self.model.hd_num)
+                    image = self.model.vis_processor(image).unsqueeze(0).to(self.device)
+                    image_embeds = self.model.encode_img(image)
+                    embeds.append(image_embeds)
+                    im_mask.append(torch.ones(image_embeds.shape[:2]).to(self.device))
+                pt1 = pts
+            embeds = torch.cat(embeds, dim=1)
+            im_mask = torch.cat(im_mask, dim=1)
+            im_mask = im_mask.bool()
+
+            if "max_new_tokens" not in gen_kwargs:
+                gen_kwargs["max_new_tokens"] = 1024
+            if "temperature" not in gen_kwargs:
+                gen_kwargs["temperature"] = 0
+            if "top_p" not in gen_kwargs:
+                gen_kwargs["top_p"] = None
+            if "num_beams" not in gen_kwargs:
+                gen_kwargs["num_beams"] = 1
+            if "do_sample" not in gen_kwargs:
+                gen_kwargs["do_sample"] = False
+            if "repetition_penalty" not in gen_kwargs:
+                gen_kwargs["repetition_penalty"] = 1.0
+
+            outputs = self.model.generate(
+                inputs_embeds=embeds,
+                im_mask=im_mask,
+                temperature=gen_kwargs["temperature"],
+                max_new_tokens=gen_kwargs["max_new_tokens"],
+                num_beams=gen_kwargs["num_beams"],
+                do_sample=gen_kwargs["do_sample"],
+                repetition_penalty=gen_kwargs["repetition_penalty"],
+            )
+            output_token = outputs[0]
+            if output_token[0] == 0 or output_token[0] == 1:
+                output_token = output_token[1:]
+            output_text = self.model.tokenizer.decode(output_token, add_special_tokens=False)
+            output_text = output_text.split("[UNUSED_TOKEN_145]")[0].strip()
+            output_text = output_text.split("<|im_end|>")[0].strip()
+            # if DATASET_TYPE(task) == "multi-choice":
+            #     output_text = pattern.findall(output_text)
+            #     if len(output_text) == 0:
+            #         print("Error:", output_text)
+            #         output_text = "Z"
+            #     if type(output_text) == list:
+            #         output_text = output_text[0]
+            res.append(output_text)
+            pbar.update(1)
+        pbar.close()
+        return res
+
+    def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]:
+        return super().loglikelihood(requests)
+
+
+def padding_336(b):
+    width, height = b.size
+    tar = int(np.ceil(height / 336) * 336)
+    top_padding = int((tar - height) / 2)
+    bottom_padding = tar - height - top_padding
+    left_padding = 0
+    right_padding = 0
+    b = transforms.functional.pad(b, [left_padding, top_padding, right_padding, bottom_padding], fill=[255, 255, 255])
+
+    return b
+
+
+def HD_transform(img, im_num=16):
+    width, height = img.size
+    trans = False
+    if width < height:
+        img = img.transpose(Image.TRANSPOSE)
+        trans = True
+        width, height = img.size
+    ratio = width / height
+    scale = 1
+    while scale * np.ceil(scale / ratio) <= im_num:
+        scale += 1
+    scale -= 1
+    new_w = int(scale * 336)
+    new_h = int(new_w / ratio)
+
+    img = transforms.functional.resize(
+        img,
+        [new_h, new_w],
+    )
+    img = padding_336(img)
+    width, height = img.size
+    assert width * height <= im_num * 336 * 336
+    if trans:
+        img = img.transpose(Image.TRANSPOSE)
+
+    return img
+
+
+def listinstr(lst, s):
+    assert isinstance(lst, list)
+    for item in lst:
+        if item in s:
+            return True
+    return False
+
+
+def DATASET_TYPE(dataset):
+    # Dealing with Custom Dataset
+    dataset = dataset.lower()
+    if listinstr(["mmbench", "seedbench", "ccbench", "mmmu", "scienceqa", "ai2d", "mmstar"], dataset):
+        return "multi-choice"
+    elif listinstr(["mme", "hallusion"], dataset):
+        return "Y/N"
+    elif "coco" in dataset:
+        return "Caption"
+    elif listinstr(["ocrvqa", "textvqa", "chartqa", "mathvista", "docvqa", "infovqa", "llavabench", "mmvet", "ocrbench"], dataset):
+        return "VQA"
+    else:
+        return "QA"
diff --git a/lmms_eval/tasks/__init__.py b/lmms_eval/tasks/__init__.py
old mode 100644
new mode 100755
index e6e25d85..0679eae3
--- a/lmms_eval/tasks/__init__.py
+++ b/lmms_eval/tasks/__init__.py
@@ -71,6 +71,10 @@ def include_task_folder(task_dir: str, register_task: bool = True) -> None:
     for root, subdirs, file_list in os.walk(task_dir):
         # if (subdirs == [] or subdirs == ["__pycache__"]) and (len(file_list) > 0):
         for f in file_list:
+            # if "detail" in f:
+            #     import pdb;pdb.set_trace()
+            # if "vatex" in f:
+            #     print("a")
             if f.endswith(".yaml"):
                 yaml_path = os.path.join(root, f)
                 try:
@@ -112,7 +116,7 @@ def initialize_tasks(verbosity="INFO"):
 
 def get_task(task_name, model_name):
     try:
-        return TASK_REGISTRY[task_name](model_name=model_name)
+        return TASK_REGISTRY[task_name](model_name=model_name)  # TODO choiszt the return result need to check " 'mmeConfigurableTask' object has no attribute '_instances'. Did you mean: 'instances'?"
     except KeyError:
         eval_logger.info("Available tasks:")
         eval_logger.info(list(TASK_REGISTRY) + list(GROUP_REGISTRY))
diff --git a/lmms_eval/tasks/_task_utils/file_utils.py b/lmms_eval/tasks/_task_utils/file_utils.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/_task_utils/gpt_eval_utils.py b/lmms_eval/tasks/_task_utils/gpt_eval_utils.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/_task_utils/video_loader.py b/lmms_eval/tasks/_task_utils/video_loader.py
new file mode 100644
index 00000000..8f081865
--- /dev/null
+++ b/lmms_eval/tasks/_task_utils/video_loader.py
@@ -0,0 +1,26 @@
+import os
+
+
+def get_cache_dir(config, sub_dir="videos"):
+    HF_HOME = os.environ["HF_HOME"]
+    cache_dir = config["dataset_kwargs"]["cache_dir"]
+    cache_dir = os.path.join(HF_HOME, cache_dir)
+    cache_dir = os.path.join(cache_dir, sub_dir)
+    return cache_dir
+
+
+def _get_video_file(prefix: str, video_name: str, suffix: str):
+    if not isinstance(video_name, str):
+        video_name = str(video_name)
+    if not video_name.endswith(suffix):
+        video_name = f"{video_name}.{suffix}"
+    video_path = os.path.join(prefix, video_name)
+    return video_path
+
+
+def get_video(prefix: str, video_name: str, suffix: str = "mp4"):
+    tried = [os.path.abspath(_get_video_file(prefix, video_name, suffix)), os.path.abspath(_get_video_file(prefix, video_name, suffix.upper())), os.path.abspath(_get_video_file(prefix, video_name, suffix.lower()))]
+    for video_path in tried:
+        if os.path.exists(video_path):
+            return video_path
+    raise FileNotFoundError(f"Tried both {tried} but none of them exist, please check")
diff --git a/lmms_eval/tasks/_task_utils/vqa_eval_metric.py b/lmms_eval/tasks/_task_utils/vqa_eval_metric.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/activitynetqa/_default_template_yaml b/lmms_eval/tasks/activitynetqa/_default_template_yaml
new file mode 100644
index 00000000..69b410a8
--- /dev/null
+++ b/lmms_eval/tasks/activitynetqa/_default_template_yaml
@@ -0,0 +1,15 @@
+dataset_path: lmms-lab/ActivityNetQA
+dataset_kwargs:
+  token: True
+  video: True
+  force_download: False
+  local_files_only: False
+  cache_dir: activitynetqa
+model_specific_prompt_kwargs:
+  default:
+    pre_prompt: ""
+    post_prompt: " Answer the question using a single word or phrase."
+
+metadata:
+  version: 0.0
+  gpt_eval_model_name: gpt-3.5-turbo-0613
\ No newline at end of file
diff --git a/lmms_eval/tasks/activitynetqa/activitynetqa_generation.yaml b/lmms_eval/tasks/activitynetqa/activitynetqa_generation.yaml
new file mode 100755
index 00000000..fe0c18e4
--- /dev/null
+++ b/lmms_eval/tasks/activitynetqa/activitynetqa_generation.yaml
@@ -0,0 +1,27 @@
+dataset_name: "Generation"
+task: "activitynetqa"
+test_split: test
+output_type: generate_until
+doc_to_visual: !function utils.activitynetqa_doc_to_visual
+doc_to_text: !function utils.activitynetqa_doc_to_text
+doc_to_target: !function utils.activitynetqa_doc_to_answer
+process_results: !function utils.activitynetqa_process_results # gpt eval here for each QA pairs
+metric_list:
+  - metric: gpt_eval_score
+    aggregation: !function utils.activitynetqa_aggregate_score # parse scores from each QA pairs
+    higher_is_better: true
+  - metric: gpt_eval_accuracy
+    aggregation: !function utils.activitynetqa_aggregate_accuracy # parse accuracy from each QA pairs
+    higher_is_better: true
+
+include: _default_template_yaml
+
+generation_kwargs:
+  until:
+    - "ASSISTANT:"
+  image_aspect_ratio: original
+  max_new_tokens: 64
+  temperature: 0
+  top_p: 1.0
+  num_beams: 1
+  do_sample: false
diff --git a/lmms_eval/tasks/activitynetqa/utils.py b/lmms_eval/tasks/activitynetqa/utils.py
new file mode 100755
index 00000000..7c236e92
--- /dev/null
+++ b/lmms_eval/tasks/activitynetqa/utils.py
@@ -0,0 +1,281 @@
+from decord import VideoReader, cpu
+import numpy as np
+import os
+import sys
+import datetime
+import lmms_eval.tasks._task_utils.file_utils as file_utils
+import json
+import logging
+import yaml
+from pathlib import Path
+
+import requests
+import openai
+from openai import OpenAI
+import time
+import ast
+
+eval_logger = logging.getLogger("lmms-eval")
+
+with open(Path(__file__).parent / "_default_template_yaml", "r") as f:
+    raw_data = f.readlines()
+    safe_data = []
+    for i, line in enumerate(raw_data):
+        # remove function definition since yaml load cannot handle it
+        if "!function" not in line:
+            safe_data.append(line)
+
+    config = yaml.safe_load("".join(safe_data))
+
+NUM_SECONDS_TO_SLEEP = 5
+
+GPT_EVAL_MODEL_NAME = config["metadata"]["gpt_eval_model_name"]
+
+API_TYPE = os.getenv("API_TYPE", "openai")
+
+if API_TYPE == "openai":
+    API_URL = os.getenv("OPENAI_API_URL", "https://api.openai.com/v1/chat/completions")
+    API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_API_KEY")
+    headers = {
+        "Authorization": f"Bearer {API_KEY}",
+        "Content-Type": "application/json",
+    }
+
+# Unzip all the zip files to HF HOME cache dir
+HF_HOME = os.environ["HF_HOME"]
+cache_dir = config["dataset_kwargs"]["cache_dir"]
+cache_dir = os.path.join(HF_HOME, cache_dir)
+cache_dir = os.path.join(cache_dir, "all_test")
+
+
+# Pass in video path here
+# Can only work correctly with video dataset
+def activitynetqa_doc_to_visual(doc):
+    video_path = os.path.join(cache_dir, f"v_{doc['video_name']}.mp4")
+    extensions = ["mp4", "webm", "mkv"]
+    for ext in extensions:
+        modified_path = video_path.replace("mp4", ext)
+        if os.path.exists(modified_path):
+            return [modified_path]
+    sys.exit(f"video path:{video_path} does not exist, please check")
+
+
+# This is the place where format the question
+def activitynetqa_doc_to_text(doc, model_specific_prompt_kwargs=None):
+    if model_specific_prompt_kwargs is None:
+        model_specific_prompt_kwargs = {}
+    pre_prompt = ""
+    post_prompt = ""
+    if "pre_prompt" in model_specific_prompt_kwargs:
+        pre_prompt = model_specific_prompt_kwargs["pre_prompt"]
+    if "post_prompt" in model_specific_prompt_kwargs:
+        post_prompt = model_specific_prompt_kwargs["post_prompt"]
+
+    raw_question = doc["question"]
+    question = raw_question.capitalize() + "?"
+
+    # type_specific_prompts = {
+    #     '3': "Please answer with 'yes' or 'no'.",
+    #     '4': "Please state the color as a single word.",
+    #     '7': "Please give the numerical answer."
+    # }
+
+    # doc_type = str(doc['type'])
+    # type_specific_prompt = type_specific_prompts.get(doc_type, "")
+
+    # return f"{pre_prompt}{question} {type_specific_prompt}{post_prompt}"
+    return f"{pre_prompt}{question}{post_prompt}"
+
+
+def activitynetqa_doc_to_answer(doc):
+    return doc["answer"]
+
+
+def get_eval(question, answer, pred, max_tokens: int, retries: int = 5):
+    global headers
+
+    messages = [
+        {
+            "role": "system",
+            "content": "You are an intelligent chatbot designed for evaluating the correctness of generative outputs for question-answer pairs. "
+            "Your task is to compare the predicted answer with the correct answer and determine if they match meaningfully. Here's how you can accomplish the task:"
+            "------"
+            "##INSTRUCTIONS: "
+            "- Focus on the meaningful match between the predicted answer and the correct answer.\n"
+            "- Consider synonyms or paraphrases as valid matches.\n"
+            "- Evaluate the correctness of the prediction compared to the answer.",
+        },
+        {
+            "role": "user",
+            "content": f"Please evaluate the following video-based question-answer pair:\n\n"
+            f"Question: {question}\n"
+            f"Correct Answer: {answer}\n"
+            f"Predicted Answer: {pred}\n\n"
+            "Provide your evaluation only as a yes/no and score where the score is an integer value between 0 and 5, with 5 indicating the highest meaningful match. "
+            "Please generate the response in the form of a Python dictionary string with keys 'pred' and 'score', where value of 'pred' is  a string of 'yes' or 'no' and value of 'score' is in INTEGER, not STRING."
+            "DO NOT PROVIDE ANY OTHER OUTPUT TEXT OR EXPLANATION. Only provide the Python dictionary string. "
+            "For example, your response should look like this: {'pred': 'yes', 'score': 4.8}.",
+        },
+    ]
+
+    payload = {
+        "model": GPT_EVAL_MODEL_NAME,
+        "messages": messages,
+        "temperature": 0,
+        "max_tokens": max_tokens,
+    }
+
+    for attempt in range(retries):
+        try:
+            response = requests.post(API_URL, headers=headers, json=payload, timeout=60)
+            response.raise_for_status()  # Raises HTTPError for bad responses
+            try:
+                response_data = response.json()  # Attempt to parse JSON
+            except requests.exceptions.JSONDecodeError:
+                eval_logger.error(f"JSON decode error on attempt {attempt + 1}. Response text: {response.text}")
+                continue  # Skip to next retry
+            content = response_data["choices"][0]["message"]["content"].strip()
+            if content != "":
+                return content, response_data["model"]
+        # Handle HTTP errors separately
+        except requests.exceptions.HTTPError as e:
+            eval_logger.error(f"HTTP error on attempt {attempt + 1}: {e}")
+        # Handle other requests-related errors
+        except requests.exceptions.RequestException as e:
+            eval_logger.error(f"Request exception on attempt {attempt + 1}: {e}")
+        except Exception as e:
+            eval_logger.error(f"Unexpected error on attempt {attempt + 1}: {e}")
+
+        # Handle other unexpected errors
+        if attempt < retries - 1:
+            time.sleep(NUM_SECONDS_TO_SLEEP)
+        else:  # If this was the last attempt, log and return empty
+            eval_logger.error(f"All {retries} attempts failed. Last error message: {e}")
+            return "", ""
+
+    return "", ""
+
+
+def parse_score(review):
+    try:
+        # Convert the string representation of a dictionary to an actual dictionary
+        review_dict = ast.literal_eval(review)
+        pred = review_dict.get("pred", "no")
+        score = review_dict.get("score", 0)
+        return [pred, float(score)]
+    except SyntaxError as e:
+        eval_logger.error(f"Syntax error parsing the review string: {e}. Review content: {review}")
+        return ["no", 0]
+    except ValueError as e:
+        eval_logger.error(f"Value error parsing the review string: {e}. Review content: {review}")
+        return ["no", 0]
+    except Exception as e:
+        eval_logger.error(f"Unexpected error parsing the review string: {e}. Review content: {review}")
+        return ["no", 0]
+
+
+def activitynetqa_process_results(doc, result):
+    """
+    Args:
+        doc: a instance of the eval dataset
+        results: [pred]
+    Returns:
+        a dictionary
+    """
+    try:
+        question = doc["question"]
+        answer = doc["answer"]
+        pred = result[0]
+
+        # Assume get_eval returns a review and the model name, and parse_score parses this review
+        review, model_name = get_eval(question, answer, pred, 64)
+        scores = parse_score(review)
+    except Exception as e:
+        eval_logger.error(f"Error for Question ID: {doc.get('question_id', 'Unknown')}: {e}")
+        review = "Failed to Get a Proper Review."
+        model_name = "Failed Request"
+        scores = ["no", 0]
+
+    return {
+        "gpt_eval_score": {"video_name": doc["video_name"], "question": doc["question"], "answer": doc["answer"], "pred": pred, "question_id": doc["question_id"], "type": doc["type"], "Correctness": scores[0], "score": scores[1]},
+        "gpt_eval_accuracy": {"video_name": doc["video_name"], "question": doc["question"], "answer": doc["answer"], "pred": pred, "question_id": doc["question_id"], "type": doc["type"], "Correctness": scores[0], "score": scores[1]},
+    }
+
+
+def activitynetqa_gpt_eval(results, args):
+    """
+    Process the result file containing predictions, score them using GPT,
+    and save the results with added scores and correctness fields to a new file.
+
+    Args:
+        result_file_path: path to the JSON file with results to be evaluated
+        eval_file_path: path to save the JSON file with evaluated results
+    """
+
+    evaluated_results = []
+
+    # Process each result to generate scores
+    for data_dict in results:
+        try:
+            question = data_dict.get("Q", "")
+            answer = data_dict.get("A", "")
+            pred = data_dict.get("pred", "")
+
+            # Assume get_eval returns a review and the model name, and parse_score parses this review
+            review, model_name = get_eval(question, answer, pred, 64)
+            scores = parse_score(review)
+        except Exception as e:
+            eval_logger.error(f"Error for Question ID: {data_dict.get('question_id', 'Unknown')}: {e}")
+            review = "Failed to Get a Proper Review."
+            model_name = "Failed Request"
+            scores = ["no", 0]
+
+        # Update the dictionary with the new entries
+        updated_dict = {"video_name": data_dict["video_name"], "Correctness": scores[0], "score": scores[1], "Q": question, "A": answer, "pred": pred, "question_id": data_dict.get("question_id"), "type": data_dict.get("type")}
+        evaluated_results.append(updated_dict)
+
+    return evaluated_results
+
+
+# Factory into different aggregate
+def activitynetqa_aggregate_score(results, args):
+
+    yes_count = 0
+    no_count = 0
+    total_score = 0
+
+    # Iterate over the results to count correctness and sum scores
+    for result_dict in results:
+        if result_dict["Correctness"] == "yes":
+            yes_count += 1
+        else:
+            no_count += 1
+        total_score += result_dict["score"]
+
+    # Calculate accuracy and average score
+    accuracy = yes_count / (yes_count + no_count) if (yes_count + no_count) > 0 else 0
+    average_score = total_score / len(results) if results else 0
+    eval_logger.info(f"Accuracy: {accuracy}")
+    eval_logger.info(f"Average Score: {average_score}")
+    return average_score
+
+
+def activitynetqa_aggregate_accuracy(results, args):
+    yes_count = 0
+    no_count = 0
+    total_score = 0
+
+    # Iterate over the results to count correctness and sum scores
+    for result_dict in results:
+        if result_dict["Correctness"] == "yes":
+            yes_count += 1
+        else:
+            no_count += 1
+        total_score += result_dict["score"]
+
+    # Calculate accuracy and average score
+    accuracy = yes_count / (yes_count + no_count) if (yes_count + no_count) > 0 else 0
+    average_score = total_score / len(results) if results else 0
+    eval_logger.info(f"Accuracy: {accuracy}")
+    eval_logger.info(f"Average Score: {average_score}")
+    return accuracy * 100
diff --git a/lmms_eval/tasks/ai2d/ai2d.yaml b/lmms_eval/tasks/ai2d/ai2d.yaml
old mode 100644
new mode 100755
index e032a3c4..5221ff64
--- a/lmms_eval/tasks/ai2d/ai2d.yaml
+++ b/lmms_eval/tasks/ai2d/ai2d.yaml
@@ -7,30 +7,49 @@ output_type: generate_until
 doc_to_visual: !function utils.ai2d_doc_to_visual
 doc_to_text: !function utils.ai2d_doc_to_text
 doc_to_target: !function utils.ai2d_doc_to_target
-generation_kwargs:
-  max_new_tokens: 16
-  temperature: 0
-  do_sample: False
-metric_list:
-  - metric: exact_match
-    aggregation: mean
-    higher_is_better: true
-    ignore_case: true
-    ignore_punctuation: true
-metadata:
-  - version: 0.0
-
+  
 model_specific_prompt_kwargs:
   default:
     prompt_format: mcq
     pre_prompt: ""
     post_prompt: "\nAnswer with the option's letter from the given choices directly."
-  # qwen formulate ai2d as question answering instead of mcq
+  gpt4v: 
+    prompt_format: mcq
+    pre_prompt: ""
+    post_prompt: "\nAbove choices are given in {option}. {content} format.\nPlease answer with the option letter from the given choices directly."
   qwen_vl:
     prompt_format: qa
     pre_prompt: ""
     post_prompt: " Answer:"
+  xcomposer2_4khd:
+    prompt_format: mcq_xcomposer
+    pre_prompt: "[UNUSED_TOKEN_146]user\nQuestion: "
+    post_prompt: "[UNUSED_TOKEN_145]\n[UNUSED_TOKEN_146]assistant\nThe answer is"
 
 model_specific_target_kwargs:
   default: "mcq"
-  qwen_vl: "qa"
\ No newline at end of file
+  qwen_vl: "qa"
+
+generation_kwargs:
+  max_new_tokens: 16
+  temperature: 0
+  do_sample: False
+
+filter_list:
+  - name: "flexible-extract"
+    filter:
+      - function: !function utils.MultiChoiceRegexFilter
+        group_select: 0
+        ignore_case: true
+        ignore_punctuation: true
+        regex_pattern: "([A-Z])\\."
+
+metric_list:
+  - metric: exact_match
+    aggregation: mean
+    higher_is_better: true
+    ignore_case: true
+    ignore_punctuation: true
+
+metadata:
+  - version: 0.0
\ No newline at end of file
diff --git a/lmms_eval/tasks/ai2d/upload_ai2d.py b/lmms_eval/tasks/ai2d/upload_ai2d.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/ai2d/utils.py b/lmms_eval/tasks/ai2d/utils.py
old mode 100644
new mode 100755
index 0549fbab..a6e2368d
--- a/lmms_eval/tasks/ai2d/utils.py
+++ b/lmms_eval/tasks/ai2d/utils.py
@@ -1,3 +1,8 @@
+from lmms_eval.filters.extraction import ExtendedRegexFilter
+from lmms_eval.filters.transformation import MapFilter
+import re
+
+
 def ai2d_doc_to_text(doc, model_specific_prompt_kwargs=None):
     question, choices = doc["question"], doc["options"]
     len_choices = len(choices)
@@ -10,6 +15,10 @@ def ai2d_doc_to_text(doc, model_specific_prompt_kwargs=None):
     elif model_specific_prompt_kwargs["prompt_format"] == "qa":
         options = "\n".join(choices)
         return f"{pre_prompt}{question}{options}{post_prompt}"
+    elif model_specific_prompt_kwargs["prompt_format"] == "mcq_xcomposer":
+        options = [chr(ord("A") + i) for i in range(len_choices)]
+        choices_str = " ".join([f"{option}. {choice}" for option, choice in zip(options, choices)])
+        return f"{pre_prompt}{question}\nContext: N/A\n{choices_str}{post_prompt}"
     else:
         raise ValueError(f"Unknown prompt format: {model_specific_prompt_kwargs['prompt_format']}")
 
@@ -25,3 +34,46 @@ def ai2d_doc_to_target(doc, model_specific_target_kwargs):
         return options[int(doc["answer"])]
     elif model_specific_target_kwargs == "qa":
         return doc["options"][int(doc["answer"])]
+
+
+class MultiChoiceRegexFilter(ExtendedRegexFilter):
+    def __init__(self, *args, **kwargs):
+        """
+        regex_pattern: The basic regex pattern to use. If fails to match, we will use the customized match procedure
+                        - step 1 : We parse the choices between ([A-Z])s then try to find these choices in the response.
+                        - step 2 : We parse the choice with regex :[\s]*([A-?]), where ? varies by number of choices.
+        group_select: Selects the (group_select)th match from the findall result.
+        ignore_case: Ignores the case during step 1 matching
+        ignore_punctuation: Remove the punctuation during step 1 matching
+        regexes_to_ignore: Remove these regexes during step 1 matching
+        """
+        super().__init__(*args, **kwargs)
+
+    def apply(self, resps, docs):
+        # here, we assume we have a list, in which each element is
+        # a list of model responses for some particular input/target pair.
+        # so we process each of these (same input/target response sets)
+        # independently (and keep them a list.)
+
+        filtered_resps = []
+
+        for r, doc in zip(resps, docs):
+            # Regex to directly extract the option letter from the model response
+            option_letter_regex = re.compile(r"^\s*([A-Z])\.")
+
+            # Process each response
+            filtered = []
+            for resp in r:
+                # Try to match the option letter at the start of the response
+                match = option_letter_regex.match(resp)
+                if match:
+                    # If a match is found, append the matched letter
+                    filtered.append(match.group(1))
+                else:
+                    # If no match, return the original response
+                    filtered.append(resp)
+
+            # Assuming we need the first response that matches or the original response
+            filtered_resps.append(filtered[0])
+
+        return filtered_resps
diff --git a/lmms_eval/tasks/chartqa/chartqa.yaml b/lmms_eval/tasks/chartqa/chartqa.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/chartqa/upload_chartqa.py b/lmms_eval/tasks/chartqa/upload_chartqa.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/chartqa/utils.py b/lmms_eval/tasks/chartqa/utils.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/cmmmu/_cmmmu.yaml b/lmms_eval/tasks/cmmmu/_cmmmu.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/cmmmu/_default_template_cmmmu_yaml b/lmms_eval/tasks/cmmmu/_default_template_cmmmu_yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/cmmmu/cmmmu_test.yaml b/lmms_eval/tasks/cmmmu/cmmmu_test.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/cmmmu/cmmmu_val.yaml b/lmms_eval/tasks/cmmmu/cmmmu_val.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/cmmmu/utils.py b/lmms_eval/tasks/cmmmu/utils.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/coco_cap/coco2014_cap.yaml b/lmms_eval/tasks/coco_cap/coco2014_cap.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/coco_cap/coco2014_cap_test.yaml b/lmms_eval/tasks/coco_cap/coco2014_cap_test.yaml
old mode 100644
new mode 100755
index e080450f..01fc93c0
--- a/lmms_eval/tasks/coco_cap/coco2014_cap_test.yaml
+++ b/lmms_eval/tasks/coco_cap/coco2014_cap_test.yaml
@@ -11,7 +11,7 @@ doc_to_target: "answer"
 generation_kwargs:
   max_new_tokens: 128
   temperature: 0
-  top_p: 0
+  top_p: 1.0
   num_beams: 1
   do_sample: false
 process_results: !function utils.coco_test_process_result
diff --git a/lmms_eval/tasks/coco_cap/coco2014_cap_val.yaml b/lmms_eval/tasks/coco_cap/coco2014_cap_val.yaml
old mode 100644
new mode 100755
index 57b01a7d..6c14f5f9
--- a/lmms_eval/tasks/coco_cap/coco2014_cap_val.yaml
+++ b/lmms_eval/tasks/coco_cap/coco2014_cap_val.yaml
@@ -11,7 +11,7 @@ doc_to_target: "answer"
 generation_kwargs:
   max_new_tokens: 64
   temperature: 0
-  top_p: 0
+  top_p: 1.0
   num_beams: 1
   do_sample: false
 process_results: !function utils.coco_process_result
diff --git a/lmms_eval/tasks/coco_cap/coco2017_cap.yaml b/lmms_eval/tasks/coco_cap/coco2017_cap.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/coco_cap/coco2017_cap_test.yaml b/lmms_eval/tasks/coco_cap/coco2017_cap_test.yaml
old mode 100644
new mode 100755
index ecd73759..39124a6a
--- a/lmms_eval/tasks/coco_cap/coco2017_cap_test.yaml
+++ b/lmms_eval/tasks/coco_cap/coco2017_cap_test.yaml
@@ -11,7 +11,7 @@ doc_to_target: "answer"
 generation_kwargs:
   max_new_tokens: 128
   temperature: 0
-  top_p: 0
+  top_p: 1.0
   num_beams: 1
   do_sample: false
 process_results: !function utils.coco_test_process_result
diff --git a/lmms_eval/tasks/coco_cap/coco2017_cap_val.yaml b/lmms_eval/tasks/coco_cap/coco2017_cap_val.yaml
old mode 100644
new mode 100755
index b0d9d4a4..4ef084b7
--- a/lmms_eval/tasks/coco_cap/coco2017_cap_val.yaml
+++ b/lmms_eval/tasks/coco_cap/coco2017_cap_val.yaml
@@ -11,7 +11,7 @@ doc_to_target: "answer"
 generation_kwargs:
   max_new_tokens: 64
   temperature: 0
-  top_p: 0
+  top_p: 1.0
   num_beams: 1
   do_sample: false
 process_results: !function utils.coco_process_result
diff --git a/lmms_eval/tasks/coco_cap/coco_cap.yaml b/lmms_eval/tasks/coco_cap/coco_cap.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/coco_cap/utils.py b/lmms_eval/tasks/coco_cap/utils.py
old mode 100644
new mode 100755
index 0102dbef..ab3a736a
--- a/lmms_eval/tasks/coco_cap/utils.py
+++ b/lmms_eval/tasks/coco_cap/utils.py
@@ -43,7 +43,7 @@ def coco_process_result(doc, result):
 
 
 def coco_aggregation_result(results, metric, args):
-    scorers = [(Bleu(4), "Bleu_1"), (Bleu(4), "Bleu_2"), (Bleu(4), "Bleu_3"), (Bleu(4), "Bleu_4"), (Meteor(), "METEOR"), (Rouge(), "ROUGE_L"), (Cider(), "CIDEr")]#, (Spice(), "SPICE")]
+    scorers = [(Bleu(4), "Bleu_1"), (Bleu(4), "Bleu_2"), (Bleu(4), "Bleu_3"), (Bleu(4), "Bleu_4"), (Meteor(), "METEOR"), (Rouge(), "ROUGE_L"), (Cider(), "CIDEr")]  # , (Spice(), "SPICE")]
     scorers_dict = {s[1]: s for s in scorers}
 
     stored_results = []
diff --git a/lmms_eval/tasks/cvrr/_cvrr.yaml b/lmms_eval/tasks/cvrr/_cvrr.yaml
new file mode 100755
index 00000000..27a9ec20
--- /dev/null
+++ b/lmms_eval/tasks/cvrr/_cvrr.yaml
@@ -0,0 +1,13 @@
+group: cvrr
+task:
+- cvrr_continuity_and_object_instance_count
+- cvrr_fine_grained_action_understanding
+- cvrr_interpretation_of_social_context
+- cvrr_interpretation_of_visual_context
+- cvrr_multiple_actions_in_a_single_video
+- cvrr_non_existent_actions_with_existent_scene_depictions
+- cvrr_non_existent_actions_with_non_existent_scene_depictions
+- cvrr_partial_actions
+- cvrr_time_order_understanding
+- cvrr_understanding_emotional_context
+- cvrr_unusual_and_physically_anomalous_activities
diff --git a/lmms_eval/tasks/cvrr/_default_template_yaml b/lmms_eval/tasks/cvrr/_default_template_yaml
new file mode 100644
index 00000000..333946a3
--- /dev/null
+++ b/lmms_eval/tasks/cvrr/_default_template_yaml
@@ -0,0 +1,13 @@
+dataset_path: lmms-lab/CVRR-ES
+dataset_kwargs:
+  token: True
+  video: True
+  cache_dir: cvrr-es
+model_specific_prompt_kwargs:
+  default:
+    pre_prompt: ""
+    post_prompt: ""
+
+metadata:
+  version: 0.0
+  gpt_eval_model_name: gpt-3.5-turbo-0125
\ No newline at end of file
diff --git a/lmms_eval/tasks/cvrr/cvrr_fine_grained_action_understanding.yaml b/lmms_eval/tasks/cvrr/cvrr_fine_grained_action_understanding.yaml
new file mode 100755
index 00000000..e5fab43d
--- /dev/null
+++ b/lmms_eval/tasks/cvrr/cvrr_fine_grained_action_understanding.yaml
@@ -0,0 +1,17 @@
+dataset_name: "fine_grained_action_understanding"
+task: "cvrr_fine_grained_action_understanding"
+test_split: test
+output_type: generate_until
+doc_to_visual: !function utils.cvrr_doc_to_visual
+doc_to_text: !function utils.cvrr_doc_to_text
+doc_to_target: !function utils.cvrr_doc_to_answer
+process_results: !function utils.cvrr_process_results
+metric_list:
+  - metric: gpt_eval_accuracy
+    aggregation: !function utils.cvrr_aggregate_accuracy
+    higher_is_better: true
+  - metric: gpt_eval_score
+    aggregation: !function utils.cvrr_aggregate_score
+    higher_is_better: true
+include: _default_template_yaml
+
diff --git a/lmms_eval/tasks/cvrr/cvrr_interpretation_of_social_context.yaml b/lmms_eval/tasks/cvrr/cvrr_interpretation_of_social_context.yaml
new file mode 100755
index 00000000..1ef53831
--- /dev/null
+++ b/lmms_eval/tasks/cvrr/cvrr_interpretation_of_social_context.yaml
@@ -0,0 +1,17 @@
+dataset_name: "interpretation_of_social_context"
+task: "cvrr_interpretation_of_social_context"
+test_split: test
+output_type: generate_until
+doc_to_visual: !function utils.cvrr_doc_to_visual
+doc_to_text: !function utils.cvrr_doc_to_text
+doc_to_target: !function utils.cvrr_doc_to_answer
+process_results: !function utils.cvrr_process_results
+metric_list:
+  - metric: gpt_eval_accuracy
+    aggregation: !function utils.cvrr_aggregate_accuracy
+    higher_is_better: true
+  - metric: gpt_eval_score
+    aggregation: !function utils.cvrr_aggregate_score
+    higher_is_better: true
+include: _default_template_yaml
+
diff --git a/lmms_eval/tasks/cvrr/cvrr_interpretation_of_visual_context.yaml b/lmms_eval/tasks/cvrr/cvrr_interpretation_of_visual_context.yaml
new file mode 100755
index 00000000..515eca6e
--- /dev/null
+++ b/lmms_eval/tasks/cvrr/cvrr_interpretation_of_visual_context.yaml
@@ -0,0 +1,17 @@
+dataset_name: "interpretation_of_visual_context"
+task: "cvrr_interpretation_of_visual_context"
+test_split: test
+output_type: generate_until
+doc_to_visual: !function utils.cvrr_doc_to_visual
+doc_to_text: !function utils.cvrr_doc_to_text
+doc_to_target: !function utils.cvrr_doc_to_answer
+process_results: !function utils.cvrr_process_results
+metric_list:
+  - metric: gpt_eval_accuracy
+    aggregation: !function utils.cvrr_aggregate_accuracy
+    higher_is_better: true
+  - metric: gpt_eval_score
+    aggregation: !function utils.cvrr_aggregate_score
+    higher_is_better: true
+include: _default_template_yaml
+
diff --git a/lmms_eval/tasks/cvrr/cvrr_multiple_actions_in_a_single_video.yaml b/lmms_eval/tasks/cvrr/cvrr_multiple_actions_in_a_single_video.yaml
new file mode 100755
index 00000000..e16746a5
--- /dev/null
+++ b/lmms_eval/tasks/cvrr/cvrr_multiple_actions_in_a_single_video.yaml
@@ -0,0 +1,17 @@
+dataset_name: "multiple_actions_in_a_single_video"
+task: "cvrr_multiple_actions_in_a_single_video"
+test_split: test
+output_type: generate_until
+doc_to_visual: !function utils.cvrr_doc_to_visual
+doc_to_text: !function utils.cvrr_doc_to_text
+doc_to_target: !function utils.cvrr_doc_to_answer
+process_results: !function utils.cvrr_process_results
+metric_list:
+  - metric: gpt_eval_accuracy
+    aggregation: !function utils.cvrr_aggregate_accuracy
+    higher_is_better: true
+  - metric: gpt_eval_score
+    aggregation: !function utils.cvrr_aggregate_score
+    higher_is_better: true
+include: _default_template_yaml
+
diff --git a/lmms_eval/tasks/cvrr/cvrr_non_existent_actions_with_existent_scene_depictions.yaml b/lmms_eval/tasks/cvrr/cvrr_non_existent_actions_with_existent_scene_depictions.yaml
new file mode 100755
index 00000000..a49a2177
--- /dev/null
+++ b/lmms_eval/tasks/cvrr/cvrr_non_existent_actions_with_existent_scene_depictions.yaml
@@ -0,0 +1,17 @@
+dataset_name: "non_existent_actions_with_existent_scene_depictions"
+task: "cvrr_non_existent_actions_with_existent_scene_depictions"
+test_split: test
+output_type: generate_until
+doc_to_visual: !function utils.cvrr_doc_to_visual
+doc_to_text: !function utils.cvrr_doc_to_text
+doc_to_target: !function utils.cvrr_doc_to_answer
+process_results: !function utils.cvrr_process_results
+metric_list:
+  - metric: gpt_eval_accuracy
+    aggregation: !function utils.cvrr_aggregate_accuracy
+    higher_is_better: true
+  - metric: gpt_eval_score
+    aggregation: !function utils.cvrr_aggregate_score
+    higher_is_better: true
+include: _default_template_yaml
+
diff --git a/lmms_eval/tasks/cvrr/cvrr_non_existent_actions_with_non_existent_scene_depictions.yaml b/lmms_eval/tasks/cvrr/cvrr_non_existent_actions_with_non_existent_scene_depictions.yaml
new file mode 100755
index 00000000..477d2a3a
--- /dev/null
+++ b/lmms_eval/tasks/cvrr/cvrr_non_existent_actions_with_non_existent_scene_depictions.yaml
@@ -0,0 +1,17 @@
+dataset_name: "non_existent_actions_with_non_existent_scene_depictions"
+task: "cvrr_non_existent_actions_with_non_existent_scene_depictions"
+test_split: test
+output_type: generate_until
+doc_to_visual: !function utils.cvrr_doc_to_visual
+doc_to_text: !function utils.cvrr_doc_to_text
+doc_to_target: !function utils.cvrr_doc_to_answer
+process_results: !function utils.cvrr_process_results
+metric_list:
+  - metric: gpt_eval_accuracy
+    aggregation: !function utils.cvrr_aggregate_accuracy
+    higher_is_better: true
+  - metric: gpt_eval_score
+    aggregation: !function utils.cvrr_aggregate_score
+    higher_is_better: true
+include: _default_template_yaml
+
diff --git a/lmms_eval/tasks/cvrr/cvrr_object_instance_count.yaml b/lmms_eval/tasks/cvrr/cvrr_object_instance_count.yaml
new file mode 100755
index 00000000..03c7329b
--- /dev/null
+++ b/lmms_eval/tasks/cvrr/cvrr_object_instance_count.yaml
@@ -0,0 +1,17 @@
+dataset_name: "continuity_and_object_instance_count"
+task: "cvrr_continuity_and_object_instance_count"
+test_split: test
+output_type: generate_until
+doc_to_visual: !function utils.cvrr_doc_to_visual
+doc_to_text: !function utils.cvrr_doc_to_text
+doc_to_target: !function utils.cvrr_doc_to_answer
+process_results: !function utils.cvrr_process_results
+metric_list:
+  - metric: gpt_eval_accuracy
+    aggregation: !function utils.cvrr_aggregate_accuracy
+    higher_is_better: true
+  - metric: gpt_eval_score
+    aggregation: !function utils.cvrr_aggregate_score
+    higher_is_better: true
+include: _default_template_yaml
+
diff --git a/lmms_eval/tasks/cvrr/cvrr_partial_actions.yaml b/lmms_eval/tasks/cvrr/cvrr_partial_actions.yaml
new file mode 100755
index 00000000..c192b969
--- /dev/null
+++ b/lmms_eval/tasks/cvrr/cvrr_partial_actions.yaml
@@ -0,0 +1,17 @@
+dataset_name: "partial_actions"
+task: "cvrr_partial_actions"
+test_split: test
+output_type: generate_until
+doc_to_visual: !function utils.cvrr_doc_to_visual
+doc_to_text: !function utils.cvrr_doc_to_text
+doc_to_target: !function utils.cvrr_doc_to_answer
+process_results: !function utils.cvrr_process_results
+metric_list:
+  - metric: gpt_eval_accuracy
+    aggregation: !function utils.cvrr_aggregate_accuracy
+    higher_is_better: true
+  - metric: gpt_eval_score
+    aggregation: !function utils.cvrr_aggregate_score
+    higher_is_better: true
+include: _default_template_yaml
+
diff --git a/lmms_eval/tasks/cvrr/cvrr_time_order_understanding.yaml b/lmms_eval/tasks/cvrr/cvrr_time_order_understanding.yaml
new file mode 100755
index 00000000..d68ce828
--- /dev/null
+++ b/lmms_eval/tasks/cvrr/cvrr_time_order_understanding.yaml
@@ -0,0 +1,17 @@
+dataset_name: "time_order_understanding"
+task: "cvrr_time_order_understanding"
+test_split: test
+output_type: generate_until
+doc_to_visual: !function utils.cvrr_doc_to_visual
+doc_to_text: !function utils.cvrr_doc_to_text
+doc_to_target: !function utils.cvrr_doc_to_answer
+process_results: !function utils.cvrr_process_results
+metric_list:
+  - metric: gpt_eval_accuracy
+    aggregation: !function utils.cvrr_aggregate_accuracy
+    higher_is_better: true
+  - metric: gpt_eval_score
+    aggregation: !function utils.cvrr_aggregate_score
+    higher_is_better: true
+include: _default_template_yaml
+
diff --git a/lmms_eval/tasks/cvrr/cvrr_understanding_emotional_context.yaml b/lmms_eval/tasks/cvrr/cvrr_understanding_emotional_context.yaml
new file mode 100755
index 00000000..3bec450f
--- /dev/null
+++ b/lmms_eval/tasks/cvrr/cvrr_understanding_emotional_context.yaml
@@ -0,0 +1,17 @@
+dataset_name: "understanding_emotional_context"
+task: "cvrr_understanding_emotional_context"
+test_split: test
+output_type: generate_until
+doc_to_visual: !function utils.cvrr_doc_to_visual
+doc_to_text: !function utils.cvrr_doc_to_text
+doc_to_target: !function utils.cvrr_doc_to_answer
+process_results: !function utils.cvrr_process_results
+metric_list:
+  - metric: gpt_eval_accuracy
+    aggregation: !function utils.cvrr_aggregate_accuracy
+    higher_is_better: true
+  - metric: gpt_eval_score
+    aggregation: !function utils.cvrr_aggregate_score
+    higher_is_better: true
+include: _default_template_yaml
+
diff --git a/lmms_eval/tasks/cvrr/cvrr_unusual_and_physically_anomalous_activities.yaml b/lmms_eval/tasks/cvrr/cvrr_unusual_and_physically_anomalous_activities.yaml
new file mode 100755
index 00000000..0fa0ea31
--- /dev/null
+++ b/lmms_eval/tasks/cvrr/cvrr_unusual_and_physically_anomalous_activities.yaml
@@ -0,0 +1,17 @@
+dataset_name: "unusual_and_physically_anomalous_activities"
+task: "cvrr_unusual_and_physically_anomalous_activities"
+test_split: test
+output_type: generate_until
+doc_to_visual: !function utils.cvrr_doc_to_visual
+doc_to_text: !function utils.cvrr_doc_to_text
+doc_to_target: !function utils.cvrr_doc_to_answer
+process_results: !function utils.cvrr_process_results
+metric_list:
+  - metric: gpt_eval_accuracy
+    aggregation: !function utils.cvrr_aggregate_accuracy
+    higher_is_better: true
+  - metric: gpt_eval_score
+    aggregation: !function utils.cvrr_aggregate_score
+    higher_is_better: true
+include: _default_template_yaml
+
diff --git a/lmms_eval/tasks/cvrr/utils.py b/lmms_eval/tasks/cvrr/utils.py
new file mode 100755
index 00000000..eef51ddc
--- /dev/null
+++ b/lmms_eval/tasks/cvrr/utils.py
@@ -0,0 +1,252 @@
+import numpy as np
+import os
+import sys
+import datetime
+import lmms_eval.tasks._task_utils.file_utils as file_utils
+import json
+import logging
+import yaml
+from pathlib import Path
+
+import requests
+import openai
+from openai import OpenAI
+import time
+import ast
+from tqdm import tqdm
+
+eval_logger = logging.getLogger("lmms-eval")
+
+with open(Path(__file__).parent / "_default_template_yaml", "r") as f:
+    raw_data = f.readlines()
+    safe_data = []
+    for i, line in enumerate(raw_data):
+        # remove function definition since yaml load cannot handle it
+        if "!function" not in line:
+            safe_data.append(line)
+
+    config = yaml.safe_load("".join(safe_data))
+
+NUM_SECONDS_TO_SLEEP = 5
+
+GPT_EVAL_MODEL_NAME = config["metadata"]["gpt_eval_model_name"]
+
+API_TYPE = os.getenv("API_TYPE", "openai")
+
+if API_TYPE == "openai":
+    API_URL = os.getenv("OPENAI_API_URL", "https://api.openai.com/v1/chat/completions")
+    API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_API_KEY")
+    headers = {
+        "Authorization": f"Bearer {API_KEY}",
+        "Content-Type": "application/json",
+    }
+
+
+# Pass in video path here
+# Can only work correctly with video llm
+def cvrr_doc_to_visual(doc):
+
+    # Unzip all the zip files to HF HOME cache dir
+    HF_HOME = os.environ["HF_HOME"]
+    cache_dir = config["dataset_kwargs"]["cache_dir"]
+    cache_dir = os.path.join(HF_HOME, cache_dir)
+    cache_dir = os.path.join(cache_dir, "CVRR-ES")
+
+    if doc["DimensionName"] == "Continuity and Object Instance Count":
+        cache_dir = os.path.join(cache_dir, "continuity_and_object_instance_count")
+    elif doc["DimensionName"] == "Fine-grained action understanding":
+        cache_dir = os.path.join(cache_dir, "fine_grained_action_understanding")
+    elif doc["DimensionName"] == "Interpretation of social context":
+        cache_dir = os.path.join(cache_dir, "interpretation_of_social_context")
+    elif doc["DimensionName"] == "Interpretation of visual context":
+        cache_dir = os.path.join(cache_dir, "interpretation_of_visual_context")
+    elif doc["DimensionName"] == "Multiple actions in a single video":
+        cache_dir = os.path.join(cache_dir, "multiple_actions_in_a_single_video")
+    elif doc["DimensionName"] == "Non-existent actions with existent scene depictions":
+        cache_dir = os.path.join(cache_dir, "non_existent_actions_with_existent_scene_depictions")
+    elif doc["DimensionName"] == "Non-existent actions with non-existent scene depictions":
+        cache_dir = os.path.join(cache_dir, "non_existent_actions_with_non_existent_scene_depictions")
+    elif doc["DimensionName"] == "Partial actions":
+        cache_dir = os.path.join(cache_dir, "partial_actions")
+    elif doc["DimensionName"] == "Time order understanding":
+        cache_dir = os.path.join(cache_dir, "time_order_understanding")
+    elif doc["DimensionName"] == "Understanding of emotional context":
+        cache_dir = os.path.join(cache_dir, "understanding_emotional_context")
+    elif doc["DimensionName"] == "Unusual and Physically Anomalous activities":
+        cache_dir = os.path.join(cache_dir, "unusual_and_physically_anomalous_activities")
+
+    video_path = doc["VideoID"]
+    video_path = os.path.join(cache_dir, video_path)
+    if os.path.exists(video_path):
+        video_path = video_path
+    else:
+        sys.exit(f"video path:{video_path} does not exist, please check")
+
+    return [video_path]
+
+
+# format the question
+def cvrr_doc_to_text(doc, model_specific_prompt_kwargs=None):
+    if model_specific_prompt_kwargs is None:
+        model_specific_prompt_kwargs = {}
+    pre_prompt = ""
+    post_prompt = ""
+    if "pre_prompt" in model_specific_prompt_kwargs:
+        pre_prompt = model_specific_prompt_kwargs["pre_prompt"]
+    if "post_prompt" in model_specific_prompt_kwargs:
+        post_prompt = model_specific_prompt_kwargs["post_prompt"]
+
+    question = doc["Q"]
+
+    return f"{pre_prompt}{question}{post_prompt}"
+
+
+# format answer
+def cvrr_doc_to_answer(doc):
+    return doc["A"]
+
+
+def get_gpt_eval(question, answer, pred, max_tokens: int, retries: int = 5):
+    global headers
+
+    messages = [
+        {
+            "role": "system",
+            "content": "You are an intelligent chatbot designed for evaluating the correctness of AI assistant predictions for question-answer pairs. "
+            "Your task is to compare the predicted answer with the ground-truth answer and determine if the predicted answer is correct or not. Here's how you can accomplish the task:"
+            "------"
+            "##INSTRUCTIONS: "
+            "- Focus on the correctness and accuracy of the predicted answer with the ground-truth.\n"
+            "- Consider predictions with less specific details as correct evaluation, unless such details are explicitly asked in the question.\n",
+        },
+        {
+            "role": "user",
+            "content": "Please evaluate the following video-based question-answer pair:\n\n"
+            f"Question: {question}\n"
+            f"Ground truth correct Answer: {answer}\n"
+            f"Predicted Answer: {pred}\n\n"
+            "Provide your evaluation as a correct/incorrect prediction along with the score where the score is an integer value between 0 (fully wrong) and 5 (fully correct). The middle score provides the percentage of correctness."
+            "Please generate the response in the form of a Python dictionary string with keys 'pred', 'score' and 'reason', where value of 'pred' is  a string of 'correct' or 'incorrect', value of 'score' is in INTEGER, not STRING and value of 'reason' should provide the reason behind the decision."
+            "Only provide the Python dictionary string."
+            'For example, your response should look like this: {"pred": "correct", "score": 4.8, "reason": reason}.',
+        },
+    ]
+
+    payload = {
+        "model": GPT_EVAL_MODEL_NAME,
+        "messages": messages,
+        "temperature": 0,
+        "max_tokens": max_tokens,
+    }
+
+    for attempt in range(retries):
+        try:
+            response = requests.post(API_URL, headers=headers, json=payload, timeout=60)
+            response.raise_for_status()  # Raises HTTPError for bad responses
+            try:
+                response_data = response.json()  # Attempt to parse JSON
+            except requests.exceptions.JSONDecodeError:
+                eval_logger.error(f"JSON decode error on attempt {attempt + 1}. Response text: {response.text}")
+                continue  # Skip to next retry
+            content = response_data["choices"][0]["message"]["content"].strip()
+            if content != "":
+                return content, response_data["model"]
+        # Handle HTTP errors separately
+        except requests.exceptions.HTTPError as e:
+            eval_logger.error(f"HTTP error on attempt {attempt + 1}: {e}")
+        # Handle other requests-related errors
+        except requests.exceptions.RequestException as e:
+            eval_logger.error(f"Request exception on attempt {attempt + 1}: {e}")
+        except Exception as e:
+            eval_logger.error(f"Unexpected error on attempt {attempt + 1}: {e}")
+
+        # Handle other unexpected errors
+        if attempt < retries - 1:
+            time.sleep(NUM_SECONDS_TO_SLEEP)
+        else:  # If this was the last attempt, log and return empty
+            eval_logger.error(f"All {retries} attempts failed. Last error message: {e}")
+            return "", ""
+
+    return "", ""
+
+
+def parse_score(review):
+    try:
+        # Convert the string representation of a dictionary to an actual dictionary
+        # Escape single quotes inside the dictionary string to prevent parsing errors
+        review_dict = ast.literal_eval(review)
+        correctness = review_dict.get("pred", "incorrect")
+        score = review_dict.get("score", 0)
+        reason = review_dict.get("reason", "")
+        return correctness, int(score), reason
+    except SyntaxError as e:
+        eval_logger.error(f"Syntax error parsing the review string: {e}. Review content: {review}")
+        return "incorrect", int(0), ""
+    except ValueError as e:
+        eval_logger.error(f"Value error parsing the review string: {e}. Review content: {review}")
+        return "incorrect", int(0), ""
+    except Exception as e:
+        eval_logger.error(f"Unexpected error parsing the review string: {e}. Review content: {review}")
+        return "incorrect", int(0), ""
+
+
+# Process result for evaluation in temporal task
+def cvrr_process_results(doc, result):
+    """
+    Args:
+        doc: a instance of the eval dataset
+        results: [pred]
+    Returns:
+        a dictionary
+    """
+    try:
+        question = doc["Q"]
+        answer = doc["A"]
+        pred = result[0]
+
+        # Assume get_gpt_eval returns a review and the model name, and parse_score parses this review
+        review, model_name = get_gpt_eval(question, answer, pred, 512)
+        correctness, score, reason = parse_score(review)
+    except Exception as e:
+        eval_logger.error(f"Error for Question ID: {doc.get('question_id', 'Unknown')}: {e}")
+        review = "Failed to Get a Proper Review."
+        model_name = "Failed Request"
+        score = 0
+        correctness = "incorrect"
+        reason = ""
+
+    return {
+        "gpt_eval_score": {"VideoID": doc["VideoID"], "Q": doc["Q"], "A": doc["A"], "pred": pred, "DimensionName": doc["DimensionName"], "correctness": correctness, "score": score, "reason": reason},
+        "gpt_eval_accuracy": {"VideoID": doc["VideoID"], "Q": doc["Q"], "A": doc["A"], "pred": pred, "DimensionName": doc["DimensionName"], "correctness": correctness, "score": score, "reason": reason},
+    }
+
+
+# Factory into different aggregate
+def cvrr_aggregate_score(results, args):
+    total_score = 0
+
+    # Iterate over the results to sum scores
+    for result_dict in results:
+        total_score += result_dict["score"]
+
+    # Calculate average score
+    average_score = total_score / len(results) if results else 0
+    eval_logger.info(f"Average Score: {average_score}")
+    return average_score
+
+
+def cvrr_aggregate_accuracy(results, args):
+    yes_count = 0
+    no_count = 0
+
+    # Iterate over the results to count correctness
+    for result_dict in results:
+        if result_dict["correctness"] == "correct":
+            yes_count += 1
+        else:
+            no_count += 1
+
+    # Calculate accuracy and average score
+    accuracy = yes_count / (yes_count + no_count) if (yes_count + no_count) > 0 else 0
+    eval_logger.info(f"Accuracy: {accuracy}")
+    return accuracy * 100
diff --git a/lmms_eval/tasks/docvqa/_default_template_docvqa_yaml b/lmms_eval/tasks/docvqa/_default_template_docvqa_yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/docvqa/docvqa.yaml b/lmms_eval/tasks/docvqa/docvqa.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/docvqa/docvqa_test.yaml b/lmms_eval/tasks/docvqa/docvqa_test.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/docvqa/docvqa_val.yaml b/lmms_eval/tasks/docvqa/docvqa_val.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/docvqa/utils.py b/lmms_eval/tasks/docvqa/utils.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/egoschema/README.md b/lmms_eval/tasks/egoschema/README.md
new file mode 100644
index 00000000..92255bcb
--- /dev/null
+++ b/lmms_eval/tasks/egoschema/README.md
@@ -0,0 +1,46 @@
+# EgoSchema
+
+## Task Description
+
+<a href="https://github.com/egoschema/EgoSchema">EgoSchema</a>  is a diagnostic benchmark for very long-form video language understanding. The task format for EgoSchema is Multi-choice Question Answering.
+
+- Questions: For each MCQ in the dataset, we provide a post prompt:`\nAnswer with the option's letter from the given choices directly.` 
+
+- Answers: As required by the official website, we match the generated option letter into index, i.e., `A: 0, B: 1, C: 2, D: 3, E: 4.` Many models like LLaVA can follow the instructions well and generate only the option letter. However, in case model may generate redundant information (e.g., the entire option string, the option sentence, etc.), we also parse these outputs based on some pre-defined rule-based matching.
+
+## Evaluation
+
+### Full set: Submission
+
+EgoSchema is intended for a 0-shot evaluation benchmark, hence the entire correct answer file will not be make public. 
+
+`lmms-eval` will automatically generate a submission file `inference_results_egoschema_{taskname}_{now_date_time}.json` under `logs/`. To evaluate on the entire benchmark,  please submit the generated submission file using CURL:
+
+`curl -X POST -H "Content-Type: application/json" -d @<path_to_json_file> https://validation-server.onrender.com/api/upload/`
+
+### Subset: Direct Scoring
+
+<a href="https://github.com/egoschema/EgoSchema">EgoSchema</a> also release the correct answers to only 500 of the EgoSchema questions provided in the subset_answers.json file intended for offline experimentation and performance tracking. Hence,`lmms-eval` will automatically generate the score for subset.
+
+# Tasks
+
+- `egoschema`: Standard MCQA for Full set. 
+- `egoschema_mc_ppl`: MCQA Perplexity task format for Full set.
+- `egoschema_subset`: Standard MCQA for Subset. 
+- `egoschema_subset_mcppl`: MCQA Perplexity task format for Subset.
+  
+## Citation
+
+```bibtex
+@inproceedings{NEURIPS2023_90ce332a,
+ author = {Mangalam, Karttikeya and Akshulakov, Raiymbek and Malik, Jitendra},
+ booktitle = {Advances in Neural Information Processing Systems},
+ editor = {A. Oh and T. Naumann and A. Globerson and K. Saenko and M. Hardt and S. Levine},
+ pages = {46212--46244},
+ publisher = {Curran Associates, Inc.},
+ title = {EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding},
+ url = {https://proceedings.neurips.cc/paper_files/paper/2023/file/90ce332aff156b910b002ce4e6880dec-Paper-Datasets_and_Benchmarks.pdf},
+ volume = {36},
+ year = {2023}
+}
+```
\ No newline at end of file
diff --git a/lmms_eval/tasks/egoschema/_default_template_yaml b/lmms_eval/tasks/egoschema/_default_template_yaml
new file mode 100644
index 00000000..8e030645
--- /dev/null
+++ b/lmms_eval/tasks/egoschema/_default_template_yaml
@@ -0,0 +1,9 @@
+dataset_path: lmms-lab/egoschema
+dataset_kwargs:
+  token: True
+  video: True
+  cache_dir: egoschema
+model_specific_prompt_kwargs:
+  default:
+    pre_prompt: ""
+    post_prompt: ""
\ No newline at end of file
diff --git a/lmms_eval/tasks/egoschema/egoschema.yaml b/lmms_eval/tasks/egoschema/egoschema.yaml
new file mode 100755
index 00000000..a094ffa1
--- /dev/null
+++ b/lmms_eval/tasks/egoschema/egoschema.yaml
@@ -0,0 +1,13 @@
+dataset_name: "GENERATION"
+task: "egoschema"
+test_split: test
+output_type: generate_until
+doc_to_visual: !function utils.egoschema_doc_to_visual
+doc_to_text: !function utils.egoschema_doc_to_text
+doc_to_target: !function utils.egoschema_doc_to_answer
+process_results: !function utils.egoschema_process_results_generation
+metric_list:
+  - metric: submission
+    aggregation: !function utils.egoschema_aggregate_mc
+    higher_is_better: true
+include: _default_template_yaml
diff --git a/lmms_eval/tasks/egoschema/egoschema_mcppl.yaml b/lmms_eval/tasks/egoschema/egoschema_mcppl.yaml
new file mode 100755
index 00000000..b3a380b0
--- /dev/null
+++ b/lmms_eval/tasks/egoschema/egoschema_mcppl.yaml
@@ -0,0 +1,14 @@
+dataset_name: "MC_PPL"
+task: "egoschema_mcppl"
+test_split: test
+output_type: multiple_choice
+doc_to_visual: !function utils.egoschema_doc_to_visual
+doc_to_text: "question"
+doc_to_target: !function utils.egoschema_doc_to_answer
+doc_to_choice: !function utils.egoschema_doc_to_choice
+process_results: !function utils.egoschema_process_results
+metric_list:
+  - metric: submission
+    aggregation: !function utils.egoschema_aggregate_mc_ppl
+    higher_is_better: true
+include: _default_template_yaml
diff --git a/lmms_eval/tasks/egoschema/egoschema_subset.yaml b/lmms_eval/tasks/egoschema/egoschema_subset.yaml
new file mode 100755
index 00000000..a1d6dd85
--- /dev/null
+++ b/lmms_eval/tasks/egoschema/egoschema_subset.yaml
@@ -0,0 +1,16 @@
+dataset_name: "Subset"
+task: "egoschema_subset"
+test_split: test
+output_type: generate_until
+doc_to_visual: !function utils.egoschema_doc_to_visual
+doc_to_text: !function utils.egoschema_doc_to_text
+doc_to_target: !function utils.egoschema_doc_to_answer
+process_results: !function utils.egoschema_process_results_generation
+metric_list:
+  - metric: submission
+    aggregation: !function utils.egoschema_aggregate_mc
+    higher_is_better: true
+  - metric: score
+    aggregation: !function utils.egoschema_aggregate_score
+    higher_is_better: true
+include: _default_template_yaml
diff --git a/lmms_eval/tasks/egoschema/egoschema_subset_mcppl.yaml b/lmms_eval/tasks/egoschema/egoschema_subset_mcppl.yaml
new file mode 100755
index 00000000..ebe86864
--- /dev/null
+++ b/lmms_eval/tasks/egoschema/egoschema_subset_mcppl.yaml
@@ -0,0 +1,17 @@
+dataset_name: "Subset"
+task: "egoschema_subset_mcppl"
+test_split: test
+output_type: multiple_choice
+doc_to_visual: !function utils.egoschema_doc_to_visual
+doc_to_text: "question"
+doc_to_target: !function utils.egoschema_doc_to_answer
+doc_to_choice: !function utils.egoschema_doc_to_choice
+process_results: !function utils.egoschema_process_results
+metric_list:
+  - metric: submission
+    aggregation: !function utils.egoschema_aggregate_mc_ppl
+    higher_is_better: true
+  - metric: score
+    aggregation: !function utils.egoschema_aggregate_score
+    higher_is_better: true
+include: _default_template_yaml
diff --git a/lmms_eval/tasks/egoschema/utils.py b/lmms_eval/tasks/egoschema/utils.py
new file mode 100755
index 00000000..7278db34
--- /dev/null
+++ b/lmms_eval/tasks/egoschema/utils.py
@@ -0,0 +1,167 @@
+from decord import VideoReader, cpu
+import numpy as np
+import os
+import sys
+import datetime
+import lmms_eval.tasks._task_utils.file_utils as file_utils
+import json
+import logging
+import yaml
+from pathlib import Path
+
+with open(Path(__file__).parent / "_default_template_yaml", "r") as f:
+    raw_data = f.readlines()
+    safe_data = []
+    for i, line in enumerate(raw_data):
+        # remove function definition since yaml load cannot handle it
+        if "!function" not in line:
+            safe_data.append(line)
+
+    config = yaml.safe_load("".join(safe_data))
+
+# We will unzip all the zip files
+# To HF HOME cache dir
+# And load it here
+HF_HOME = os.environ["HF_HOME"]
+cache_dir = config["dataset_kwargs"]["cache_dir"]
+cache_dir = os.path.join(HF_HOME, cache_dir)
+cache_dir = os.path.join(cache_dir, "videos")
+
+eval_logger = logging.getLogger("lmms-eval")
+
+
+# Pass in video path here
+# Can only work correctly with video llm
+def egoschema_doc_to_visual(doc):
+    video_path = doc["video_idx"] + ".mp4"
+    video_path = os.path.join(cache_dir, video_path)
+    if os.path.exists(video_path):
+        video_path = video_path
+    elif os.path.exists(video_path.replace("mp4", "MP4")):
+        video_path = video_path.replace("mp4", "MP4")
+    else:
+        sys.exit(f"video path:{video_path} does not exist, please check")
+    return [video_path]
+
+
+# This is the place where you format your question
+def egoschema_doc_to_text(doc, model_specific_prompt_kwargs=None):
+    if model_specific_prompt_kwargs is None:
+        model_specific_prompt_kwargs = {}
+    pre_prompt = ""
+    post_prompt = ""
+    if "pre_prompt" in model_specific_prompt_kwargs:
+        pre_prompt = model_specific_prompt_kwargs["pre_prompt"]
+    if "post_prompt" in model_specific_prompt_kwargs:
+        post_prompt = model_specific_prompt_kwargs["post_prompt"]
+
+    question = doc["question"]
+    if "option" in doc:
+        for op in doc["option"]:
+            question += "\n" + op
+        post_prompt = "\nAnswer with the option's letter from the given choices directly."
+
+    return f"{pre_prompt}{question}{post_prompt}"
+
+
+def egoschema_doc_to_answer(doc):
+    return doc["answer"]
+
+
+# Process result for mc_ppl
+def egoschema_process_results(doc, result):
+    # Initialize minimum value and index
+    min_value = float("inf")
+    min_index = -1
+
+    # Iterate through the results to find the index of the lowest value
+    for i, (value, _) in enumerate(result):
+        if value < min_value:
+            min_value = value
+            min_index = i
+
+    # Return the result with the index of the lowest value
+    return {"submission": {doc["video_idx"]: min_index}, "score": {"pred": min_index, "ground_truth": doc["answer"]}}
+
+
+# Process result for mcq answer generation
+def egoschema_process_results_generation(doc, result):
+    pred = result[0]
+
+    # Determine whether the video LLM output is correct, based on word matching rules
+    # Ensure each option string ends with a period
+    option_strs = [opt if opt.endswith(".") else opt + "." for opt in doc["option"]]  # Complete option strings
+    option_sents = [opt.split(". ")[1] if ". " in opt else opt for opt in option_strs]  # Option sentence
+    option_inds = [opt.split(". ")[0] if ". " in opt else opt for opt in option_strs]  # Option letter, e.g., A, B, C, D, E
+
+    video_llm_pred = None
+    index = -1
+
+    # Check if the prediction matches any of the complete option strings
+    for idx, option_str in enumerate(option_strs):
+        if pred == option_str:
+            video_llm_pred = option_str
+            index = idx
+            break
+
+    # Check if the prediction matches any of the option sentences
+    if not video_llm_pred:
+        for idx, option_sent in enumerate(option_sents):
+            if pred == option_sent:
+                video_llm_pred = option_sent
+                index = idx
+                break
+
+    # Check if the prediction matches any of the option letters
+    if not video_llm_pred:
+        for idx, option_ind in enumerate(option_inds):
+            if pred == option_ind or pred == option_ind.replace(".", ""):
+                video_llm_pred = option_ind
+                index = idx
+                break
+
+    return {"submission": {doc["video_idx"]: index}, "score": {"pred": index, "ground_truth": doc["answer"]}}
+
+
+def egoschema_aggregate_submissions(results, args, task):
+    now_date_time = datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
+    submission_file_name = f"inference_results_egoschema_{task}_{now_date_time}.json"
+    path = file_utils.generate_submission_file(submission_file_name, args)
+
+    # results is a list of 5031 dict,
+    # need to convert results into a single dict with 5031 key-value pairs
+    combined_submission = {}
+
+    for submission_dict in results:
+        combined_submission.update(submission_dict)
+
+    with open(path, "w") as f:
+        json.dump(combined_submission, f, indent=4)
+
+    eval_logger.info(f"Submission file saved to {path}")
+
+
+# Factory into different aggregate
+def egoschema_aggregate_mc(results, args):
+    egoschema_aggregate_submissions(results, args, "MC")
+
+
+def egoschema_aggregate_mc_ppl(results, args):
+    egoschema_aggregate_submissions(results, args, "MC_PPL")
+
+
+def egoschema_aggregate_score(results, args):
+    yes_count = 0
+
+    # results is a list of dict
+    for answer_dict in results:
+        if str(answer_dict["ground_truth"]) == str(answer_dict["pred"]):
+            yes_count = yes_count + 1
+
+    accuracy = yes_count / len(results)
+
+    return accuracy
+
+
+def egoschema_doc_to_choice(doc):
+    return [op.split(".")[1].strip() for op in doc["option"]]
diff --git a/lmms_eval/tasks/ferret/ferret.yaml b/lmms_eval/tasks/ferret/ferret.yaml
old mode 100644
new mode 100755
index 517649e7..249b711b
--- a/lmms_eval/tasks/ferret/ferret.yaml
+++ b/lmms_eval/tasks/ferret/ferret.yaml
@@ -13,7 +13,7 @@ generation_kwargs:
   image_aspect_ratio: original
   max_new_tokens: 1024
   temperature: 0
-  top_p: 0
+  top_p: 1.0
   num_beams: 1
   do_sample: false
 process_results: !function utils.ferret_process_results
diff --git a/lmms_eval/tasks/ferret/rule.json b/lmms_eval/tasks/ferret/rule.json
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/ferret/utils.py b/lmms_eval/tasks/ferret/utils.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/flickr30k/flickr30k.yaml b/lmms_eval/tasks/flickr30k/flickr30k.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/flickr30k/flickr30k_test.yaml b/lmms_eval/tasks/flickr30k/flickr30k_test.yaml
old mode 100644
new mode 100755
index 737d9ff4..e38ae9e7
--- a/lmms_eval/tasks/flickr30k/flickr30k_test.yaml
+++ b/lmms_eval/tasks/flickr30k/flickr30k_test.yaml
@@ -10,7 +10,7 @@ doc_to_target: "answer"
 generation_kwargs:
   max_new_tokens: 64
   temperature: 0
-  top_p: 0
+  top_p: 1.0
   num_beams: 1
   do_sample: false
 process_results: !function utils.flickr_process_result
diff --git a/lmms_eval/tasks/flickr30k/utils.py b/lmms_eval/tasks/flickr30k/utils.py
old mode 100644
new mode 100755
index f5d5c144..8fa1069a
--- a/lmms_eval/tasks/flickr30k/utils.py
+++ b/lmms_eval/tasks/flickr30k/utils.py
@@ -41,7 +41,7 @@ def flickr_process_result(doc, result):
 
 
 def flickr_aggregation_result(results, metric, args):
-    scorers = [(Bleu(4), "Bleu_1"), (Bleu(4), "Bleu_2"), (Bleu(4), "Bleu_3"), (Bleu(4), "Bleu_4"), (Meteor(), "METEOR"), (Rouge(), "ROUGE_L"), (Cider(), "CIDEr")]#, (Spice(), "SPICE")]
+    scorers = [(Bleu(4), "Bleu_1"), (Bleu(4), "Bleu_2"), (Bleu(4), "Bleu_3"), (Bleu(4), "Bleu_4"), (Meteor(), "METEOR"), (Rouge(), "ROUGE_L"), (Cider(), "CIDEr")]  # , (Spice(), "SPICE")]
     scorers_dict = {s[1]: s for s in scorers}
 
     stored_results = []
diff --git a/lmms_eval/tasks/gqa/gqa.yaml b/lmms_eval/tasks/gqa/gqa.yaml
old mode 100644
new mode 100755
index 404e5a05..5d06c37b
--- a/lmms_eval/tasks/gqa/gqa.yaml
+++ b/lmms_eval/tasks/gqa/gqa.yaml
@@ -11,7 +11,7 @@ doc_to_target: "answer"
 generation_kwargs:
   max_new_tokens: 16
   temperature: 0
-  top_p: 0
+  top_p: 1.0
   num_beams: 1
   do_sample: false
 metric_list:
diff --git a/lmms_eval/tasks/gqa/utils.py b/lmms_eval/tasks/gqa/utils.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/hallusion_bench/evaluate_hb.py b/lmms_eval/tasks/hallusion_bench/evaluate_hb.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/hallusion_bench/hallusion_bench_image.yaml b/lmms_eval/tasks/hallusion_bench/hallusion_bench_image.yaml
old mode 100644
new mode 100755
index 39a5be4c..d5a485ba
--- a/lmms_eval/tasks/hallusion_bench/hallusion_bench_image.yaml
+++ b/lmms_eval/tasks/hallusion_bench/hallusion_bench_image.yaml
@@ -15,7 +15,7 @@ model_specific_prompt_kwargs:
 generation_kwargs:
   max_new_tokens: 128
   temperature: 0
-  top_p: 0
+  top_p: 1.0
   num_beams: 1
   do_sample: false
 metric_list:
diff --git a/lmms_eval/tasks/hallusion_bench/utils.py b/lmms_eval/tasks/hallusion_bench/utils.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/iconqa/_default_template_docvqa_yaml b/lmms_eval/tasks/iconqa/_default_template_docvqa_yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/iconqa/iconqa.yaml b/lmms_eval/tasks/iconqa/iconqa.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/iconqa/iconqa_test.yaml b/lmms_eval/tasks/iconqa/iconqa_test.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/iconqa/iconqa_val.yaml b/lmms_eval/tasks/iconqa/iconqa_val.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/iconqa/utils.py b/lmms_eval/tasks/iconqa/utils.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/infovqa/_default_template_infovqa_yaml b/lmms_eval/tasks/infovqa/_default_template_infovqa_yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/infovqa/infovqa.yaml b/lmms_eval/tasks/infovqa/infovqa.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/infovqa/infovqa_test.yaml b/lmms_eval/tasks/infovqa/infovqa_test.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/infovqa/infovqa_val.yaml b/lmms_eval/tasks/infovqa/infovqa_val.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/infovqa/utils.py b/lmms_eval/tasks/infovqa/utils.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/internal_eval/_default_template_internal_eval_yaml b/lmms_eval/tasks/internal_eval/_default_template_internal_eval_yaml
new file mode 100755
index 00000000..cd692a1d
--- /dev/null
+++ b/lmms_eval/tasks/internal_eval/_default_template_internal_eval_yaml
@@ -0,0 +1,4 @@
+model_specific_prompt_kwargs:
+  default:
+    pre_prompt: ""
+    post_prompt: ""
\ No newline at end of file
diff --git a/lmms_eval/tasks/internal_eval/d170_cn.yaml b/lmms_eval/tasks/internal_eval/d170_cn.yaml
new file mode 100755
index 00000000..dc6e25b0
--- /dev/null
+++ b/lmms_eval/tasks/internal_eval/d170_cn.yaml
@@ -0,0 +1,32 @@
+dataset_path: lmms-lab/D170_v4.1_CN
+dataset_kwargs:
+  token: True
+task: "d170_cn"
+test_split: test
+output_type: generate_until
+doc_to_text: !function utils.doc_to_text # Such that {{prompt}} will be replaced by doc["question"]
+doc_to_visual: !function d170_cn_utils.doc_to_visual
+doc_to_target: "{{annotation}}"
+generation_kwargs:
+  until:
+    - "ASSISTANT:"
+  max_new_tokens: 1024
+  temperature: 0
+  top_p: 1.0
+  num_beams: 1
+  do_sample: false
+process_results: !function d170_cn_utils.process_results # apply gpt eval here
+metric_list:
+  - metric: gpt_eval_info
+    aggregation: !function d170_cn_utils.d170_cn_aggregate_info
+    higher_is_better: false
+  - metric: gpt_eval_avg_score
+    aggregation: !function d170_cn_utils.d170_cn_aggregate_avg_score
+    higher_is_better: true
+  - metric: gpt_eval_score2_rate
+    aggregation: !function d170_cn_utils.d170_cn_aggregate_score2_rate
+    higher_is_better: true
+metadata:
+  version: 0.0
+  gpt_eval_model_name: "gpt-4-1106-preview"
+include: _default_template_internal_eval_yaml
\ No newline at end of file
diff --git a/lmms_eval/tasks/internal_eval/d170_cn_utils.py b/lmms_eval/tasks/internal_eval/d170_cn_utils.py
new file mode 100755
index 00000000..060229d7
--- /dev/null
+++ b/lmms_eval/tasks/internal_eval/d170_cn_utils.py
@@ -0,0 +1,140 @@
+import os
+import requests
+import time
+import logging
+import yaml
+from pathlib import Path
+import re
+import json
+
+from lmms_eval.tasks._task_utils.file_utils import generate_submission_file
+
+eval_logger = logging.getLogger("lmms-eval")
+
+with open(Path(__file__).parent / "d170_cn.yaml", "r") as f:
+    raw_data = f.readlines()
+    safe_data = []
+    for i, line in enumerate(raw_data):
+        # remove function definition since yaml load cannot handle it
+        if "!function" not in line:
+            safe_data.append(line)
+
+    config = yaml.safe_load("".join(safe_data))
+
+API_URL = os.getenv("OPENAI_API_URL", "https://api.openai.com/v1/chat/completions")
+API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_API_KEY")
+GPT_EVAL_MODEL_NAME = config["metadata"]["gpt_eval_model_name"]
+
+# The EVALUATION_PROMPT_TEMPLATE_SIMPLE_V2 constant should be defined here
+EVALUATION_PROMPT_TEMPLATE_SIMPLE_V2 = """You are an expert in judging the quality of a model response compared with given ground truth. The model response is in English while the ground truth can be in English or Chinese, or both. You should only judge the relevance of the model response to the ground truth based on meanings, not the language.
+If the model response and ground truth are about grounding object coordinates, you may pay attention that the model responses are in format of [x_min, y_min, x_max, y_max]. You could judge the grounding quality by the IoU of the model response and the ground truth, or the distance between the center of the model response and the ground truth. If IoU is above 0.5 or the distance is below 0.3, you could give a score of 2. If IoU is below 0.2 or the distance is above 0.5, you could give a score of 0. If IoU is between 0.2 and 0.5 or the distance is between 0.2 and 0.5, you could give a score of 1.
+Your response should be an integer score in [0, 1, 2], where 0 means the model response is completely irrelevant to the ground truth, and 2 means the model response completely matches the ground truth. You would have specific score criteria in the ground truth. You also need to explain your score in English.
+Text: {prompt}
+Ground Truth: {ground_truth}
+You should response by following format:
+Score:
+Explanation:"""
+
+
+def get_chat_response(prompt, model=GPT_EVAL_MODEL_NAME, max_tokens=512, patience=3, sleep_time=15):
+    headers = {
+        "Authorization": f"Bearer {API_KEY}",
+        "Content-Type": "application/json",
+    }
+
+    messages = [
+        {"role": "user", "content": prompt},
+    ]
+
+    payload = {
+        "model": model,
+        "messages": messages,
+        "max_tokens": max_tokens,
+        "temperature": 0.0,
+    }
+
+    while patience > 0:
+        patience -= 1
+        try:
+            response = requests.post(
+                API_URL,
+                headers=headers,
+                json=payload,
+                timeout=60,
+            )
+            response.raise_for_status()
+            response_data = response.json()
+
+            content = response_data["choices"][0]["message"]["content"].strip()
+            if content != "":
+                return content, response_data["model"]
+
+        except Exception as e:
+            eval_logger.info(f"Error in response: {response.json()['error']['message']}")
+            if "Rate limit" in str(e):
+                eval_logger.info("Sleeping due to rate limit...")
+                time.sleep(sleep_time)
+            eval_logger.info(f"Retrying...Patience left: {patience}")
+
+    return "", ""
+
+
+def doc_to_visual(doc):
+    if doc["image"] is None:
+        return []
+    return [doc["image"].convert("RGB")]
+
+
+def process_results(doc, results):
+    # get pred and ground truth here
+    pred = results[0]
+    question = doc["question"]
+    answer = doc["annotation"]
+    gpt_query_prompt = EVALUATION_PROMPT_TEMPLATE_SIMPLE_V2.format(prompt=pred, ground_truth=answer)
+    grade_sample_run_complete = False
+    while not grade_sample_run_complete:
+        try:
+            response, model_name = get_chat_response(gpt_query_prompt)
+            grade_sample_run_complete = True
+        except Exception as e:
+            eval_logger.info(f"Error in response: {e}")
+            eval_logger.info(f"Retrying...")
+
+    try:
+        score = int(re.findall(r"Score:\s*(\d)", response)[0])
+    except IndexError:
+        score = 0  # Assign score 0 if the score wasn't parsed correctly
+
+    return {
+        "gpt_eval_info": {"question_id": doc["question_id"], "prediction": pred, "ground_truth": answer, "eval_model": model_name, "prompt": gpt_query_prompt, "response": response},
+        "gpt_eval_avg_score": {
+            "score": score,
+        },
+        "gpt_eval_score2_rate": {
+            "score": score,
+        },
+    }
+
+
+def d170_cn_aggregate_info(results, args):
+    path = generate_submission_file("dc170_cn_eval_info.json", args)
+    with open(path, "w") as f:
+        json.dump(results, f)
+    eval_logger.info(f"Results saved to {path}.")
+
+
+def d170_cn_aggregate_avg_score(results):
+    total_score = 0
+    for result in results:
+        total_score += result["score"]
+    avg_score = total_score / len(results)
+    return avg_score
+
+
+def d170_cn_aggregate_score2_rate(results):
+    score2_count = 0
+    for result in results:
+        if result["score"] == 2:
+            score2_count += 1
+    score2_rate = score2_count / len(results)
+    return score2_rate
diff --git a/lmms_eval/tasks/internal_eval/d170_en.yaml b/lmms_eval/tasks/internal_eval/d170_en.yaml
new file mode 100755
index 00000000..7a4bada6
--- /dev/null
+++ b/lmms_eval/tasks/internal_eval/d170_en.yaml
@@ -0,0 +1,32 @@
+dataset_path: lmms-lab/D170_v4.1_EN
+dataset_kwargs:
+  token: True
+task: "d170_en"
+test_split: test
+output_type: generate_until
+doc_to_visual: !function d170_en_utils.doc_to_visual
+doc_to_text: !function utils.doc_to_text # Such that {{prompt}} will be replaced by doc["question"]
+doc_to_target: "{{annotation}}"
+generation_kwargs:
+  until:
+    - "ASSISTANT:"
+  max_new_tokens: 1024
+  temperature: 0
+  top_p: 1.0
+  num_beams: 1
+  do_sample: false
+process_results: !function d170_en_utils.process_results # apply gpt eval here
+metric_list:
+  - metric: gpt_eval_info
+    aggregation: !function d170_en_utils.d170_en_aggregate_info
+    higher_is_better: false
+  - metric: gpt_eval_avg_score
+    aggregation: !function d170_en_utils.d170_en_aggregate_avg_score
+    higher_is_better: true
+  - metric: gpt_eval_score2_rate
+    aggregation: !function d170_en_utils.d170_en_aggregate_score2_rate
+    higher_is_better: true
+metadata:
+  version: 0.0
+  gpt_eval_model_name: "gpt-4-1106-preview"
+include: _default_template_internal_eval_yaml
\ No newline at end of file
diff --git a/lmms_eval/tasks/internal_eval/d170_en_utils.py b/lmms_eval/tasks/internal_eval/d170_en_utils.py
new file mode 100755
index 00000000..b96017b4
--- /dev/null
+++ b/lmms_eval/tasks/internal_eval/d170_en_utils.py
@@ -0,0 +1,140 @@
+import os
+import requests
+import time
+import logging
+import yaml
+from pathlib import Path
+import re
+import json
+
+from lmms_eval.tasks._task_utils.file_utils import generate_submission_file
+
+eval_logger = logging.getLogger("lmms-eval")
+
+with open(Path(__file__).parent / "d170_en.yaml", "r") as f:
+    raw_data = f.readlines()
+    safe_data = []
+    for i, line in enumerate(raw_data):
+        # remove function definition since yaml load cannot handle it
+        if "!function" not in line:
+            safe_data.append(line)
+
+    config = yaml.safe_load("".join(safe_data))
+
+API_URL = os.getenv("OPENAI_API_URL", "https://api.openai.com/v1/chat/completions")
+API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_API_KEY")
+GPT_EVAL_MODEL_NAME = config["metadata"]["gpt_eval_model_name"]
+
+# The EVALUATION_PROMPT_TEMPLATE_SIMPLE_V2 constant should be defined here
+EVALUATION_PROMPT_TEMPLATE_SIMPLE_V2 = """You are an expert in judging the quality of a model response compared with given ground truth. The model response is in English while the ground truth can be in English or Chinese, or both. You should only judge the relevance of the model response to the ground truth based on meanings, not the language.
+If the model response and ground truth are about grounding object coordinates, you may pay attention that the model responses are in format of [x_min, y_min, x_max, y_max]. You could judge the grounding quality by the IoU of the model response and the ground truth, or the distance between the center of the model response and the ground truth. If IoU is above 0.5 or the distance is below 0.3, you could give a score of 2. If IoU is below 0.2 or the distance is above 0.5, you could give a score of 0. If IoU is between 0.2 and 0.5 or the distance is between 0.2 and 0.5, you could give a score of 1.
+Your response should be an integer score in [0, 1, 2], where 0 means the model response is completely irrelevant to the ground truth, and 2 means the model response completely matches the ground truth. You would have specific score criteria in the ground truth. You also need to explain your score in English.
+Text: {prompt}
+Ground Truth: {ground_truth}
+You should response by following format:
+Score:
+Explanation:"""
+
+
+def get_chat_response(prompt, model=GPT_EVAL_MODEL_NAME, max_tokens=512, patience=3, sleep_time=15):
+    headers = {
+        "Authorization": f"Bearer {API_KEY}",
+        "Content-Type": "application/json",
+    }
+
+    messages = [
+        {"role": "user", "content": prompt},
+    ]
+
+    payload = {
+        "model": model,
+        "messages": messages,
+        "max_tokens": max_tokens,
+        "temperature": 0.0,
+    }
+
+    while patience > 0:
+        patience -= 1
+        try:
+            response = requests.post(
+                API_URL,
+                headers=headers,
+                json=payload,
+                timeout=60,
+            )
+            response.raise_for_status()
+            response_data = response.json()
+
+            content = response_data["choices"][0]["message"]["content"].strip()
+            if content != "":
+                return content, response_data["model"]
+
+        except Exception as e:
+            eval_logger.info(f"Error in response: {response.json()['error']['message']}")
+            if "Rate limit" in str(e):
+                eval_logger.info("Sleeping due to rate limit...")
+                time.sleep(sleep_time)
+            eval_logger.info(f"Retrying...Patience left: {patience}")
+
+    return "", ""
+
+
+def doc_to_visual(doc):
+    if doc["image"] is None:
+        return []
+    return [doc["image"].convert("RGB")]
+
+
+def process_results(doc, results):
+    # get pred and ground truth here
+    pred = results[0]
+    question = doc["question"]
+    answer = doc["annotation"]
+    gpt_query_prompt = EVALUATION_PROMPT_TEMPLATE_SIMPLE_V2.format(prompt=pred, ground_truth=answer)
+    grade_sample_run_complete = False
+    while not grade_sample_run_complete:
+        try:
+            response, model_name = get_chat_response(gpt_query_prompt)
+            grade_sample_run_complete = True
+        except Exception as e:
+            eval_logger.info(f"Error in response: {e}")
+            eval_logger.info(f"Retrying...")
+
+    try:
+        score = int(re.findall(r"Score:\s*(\d)", response)[0])
+    except IndexError:
+        score = 0  # Assign score 0 if the score wasn't parsed correctly
+
+    return {
+        "gpt_eval_info": {"question_id": doc["question_id"], "prediction": pred, "ground_truth": answer, "eval_model": model_name, "prompt": gpt_query_prompt, "response": response},
+        "gpt_eval_avg_score": {
+            "score": score,
+        },
+        "gpt_eval_score2_rate": {
+            "score": score,
+        },
+    }
+
+
+def d170_en_aggregate_info(results, args):
+    path = generate_submission_file("dc170_en_eval_info.json", args)
+    with open(path, "w") as f:
+        json.dump(results, f)
+    eval_logger.info(f"Results saved to {path}.")
+
+
+def d170_en_aggregate_avg_score(results):
+    total_score = 0
+    for result in results:
+        total_score += result["score"]
+    avg_score = total_score / len(results)
+    return avg_score
+
+
+def d170_en_aggregate_score2_rate(results):
+    score2_count = 0
+    for result in results:
+        if result["score"] == 2:
+            score2_count += 1
+    score2_rate = score2_count / len(results)
+    return score2_rate
diff --git a/lmms_eval/tasks/internal_eval/dc100_en.yaml b/lmms_eval/tasks/internal_eval/dc100_en.yaml
new file mode 100755
index 00000000..bf0a4bc9
--- /dev/null
+++ b/lmms_eval/tasks/internal_eval/dc100_en.yaml
@@ -0,0 +1,29 @@
+dataset_path: lmms-lab/DC100_EN
+dataset_kwargs:
+  token: True
+task: "dc100_en"
+test_split: test
+output_type: generate_until
+doc_to_visual: !function dc100_en_utils.doc_to_visual
+doc_to_text: !function utils.doc_to_text # Such that {{prompt}} will be replaced by doc["question"]
+doc_to_target: "answer"
+generation_kwargs:
+  until:
+    - "ASSISTANT:"
+  max_new_tokens: 1024
+  temperature: 0
+  top_p: 1.0
+  num_beams: 1
+  do_sample: false
+process_results: !function dc100_en_utils.process_results # apply gpt eval here
+metric_list:
+  - metric: gpt_eval_info
+    aggregation: !function dc100_en_utils.dc100_en_aggregate_info
+    higher_is_better: false
+  - metric: gpt_eval_avg_score
+    aggregation: !function dc100_en_utils.dc100_en_aggregate_avg_score
+    higher_is_better: true
+metadata:
+  version: 0.0
+  gpt_eval_model_name: "gpt-4-vision-preview"
+include: _default_template_internal_eval_yaml
\ No newline at end of file
diff --git a/lmms_eval/tasks/internal_eval/dc100_en_utils.py b/lmms_eval/tasks/internal_eval/dc100_en_utils.py
new file mode 100755
index 00000000..5e2b2a33
--- /dev/null
+++ b/lmms_eval/tasks/internal_eval/dc100_en_utils.py
@@ -0,0 +1,125 @@
+import base64
+import requests
+import re
+import logging
+import time
+import os
+import yaml
+import json
+from pathlib import Path
+from io import BytesIO
+
+from lmms_eval.tasks._task_utils.file_utils import generate_submission_file
+
+
+def doc_to_visual(doc):
+    if doc["image"] is None:
+        return []
+    return [doc["image"].convert("RGB")]
+
+
+eval_logger = logging.getLogger("lmms-eval")
+
+# Assuming the config is loaded similarly as in d170_en/utils.py
+with open(Path(__file__).parent / "dc100_en.yaml", "r") as f:
+    raw_data = f.readlines()
+    safe_data = []
+    for i, line in enumerate(raw_data):
+        if "!function" not in line:
+            safe_data.append(line)
+    config = yaml.safe_load("".join(safe_data))
+
+API_URL = os.getenv("OPENAI_API_URL", "https://api.openai.com/v1/chat/completions")
+API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_API_KEY")
+GPT_EVAL_MODEL_NAME = config["metadata"]["gpt_eval_model_name"]
+
+EVALUATION_PROMPT_TEMPLATE_SIMPLE_V1 = """Text Caption: {caption}
+From 0 to 100, how much do you rate for this Text Caption in terms of the correct and comprehensive description of the image?
+Do not dominant the rating by a single attribute such as recognition correctness, but a overall rating on the object/scene appearance, position, pose, action, shape, etc., and contents in the background. 
+Do not consider the appropriateness or sensitive descriptors, such as "middle-aged western man", judge based on if it has correct specifications of the object and scenes in image.
+Provide a few lines for explanation and the rate number at last after "Final Score:"."""
+
+
+def get_chat_response(base64_image, prompt, max_retries=5, wait_time=10):
+    headers = {
+        "Authorization": f"Bearer {API_KEY}",
+        "Content-Type": "application/json",
+    }
+
+    payload = {
+        "model": GPT_EVAL_MODEL_NAME,
+        "messages": [
+            {
+                "role": "user",
+                "content": [
+                    {"type": "text", "text": prompt},
+                    {
+                        "type": "image_url",
+                        "image_url": f"data:image/jpeg;base64,{base64_image}",
+                    },
+                ],
+            }
+        ],
+        "max_tokens": 1024,
+        "temperature": 0.0,
+    }
+
+    for attempt in range(max_retries):
+        try:
+            response = requests.post(API_URL, headers=headers, json=payload, timeout=60)
+            response.raise_for_status()
+            response_data = response.json()
+            return response_data["choices"][0]["message"]["content"]
+        except requests.exceptions.RequestException as e:
+            eval_logger.warning(f"Request failed on attempt {attempt+1}: {e}")
+            time.sleep(wait_time)
+            if attempt == max_retries - 1:
+                eval_logger.error(f"Failed to get response after {max_retries} attempts")
+                return ""
+        except Exception as e:
+            eval_logger.error(f"Error on attempt {attempt+1}: {e}")
+            return ""
+
+
+def image_to_base64(pil_image):
+    buffered = BytesIO()
+    pil_image.save(buffered, format="PNG")
+    return base64.b64encode(buffered.getvalue()).decode("utf-8")
+
+
+def process_results(doc, results):
+    prediction = results[0]
+    question_id = doc["question_id"]
+    image_path = doc["image"]
+    base64_image = image_to_base64(image_path)
+    prompt = EVALUATION_PROMPT_TEMPLATE_SIMPLE_V1.format(caption=prediction)
+    try:
+        response = get_chat_response(base64_image, prompt)
+        score_value = re.search(r"Final Score: (\d+)", response)
+        score = int(score_value.group(1)) if score_value else 0
+    except Exception as e:
+        eval_logger.error(f"Error for Question ID: {question_id}: {e}")
+        response = ""
+        score = 0
+
+    return {
+        "gpt_eval_info": {"question_id": question_id, "question": doc["question"], "model_caption": prediction, "explanation": response, "eval_model": GPT_EVAL_MODEL_NAME, "score": score, "prompt": prompt},
+        "gpt_eval_avg_score": {
+            "score": score,
+        },
+    }
+
+
+def dc100_en_aggregate_info(results, args):
+    path = generate_submission_file("dc100_en_eval_info.json", args)
+    with open(path, "w") as f:
+        json.dump(results, f)
+    eval_logger.info(f"Results saved to {path}.")
+
+
+def dc100_en_aggregate_avg_score(results):
+    total_score = 0
+    for result in results:
+        total_score += result["score"]
+    avg_score = total_score / len(results)
+    return avg_score
diff --git a/lmms_eval/tasks/internal_eval/dc200_cn.yaml b/lmms_eval/tasks/internal_eval/dc200_cn.yaml
new file mode 100755
index 00000000..28439e27
--- /dev/null
+++ b/lmms_eval/tasks/internal_eval/dc200_cn.yaml
@@ -0,0 +1,29 @@
+dataset_path: lmms-lab/DC200_CN
+dataset_kwargs:
+  token: True
+task: "dc200_cn"
+test_split: test
+output_type: generate_until
+doc_to_visual: !function dc200_cn_utils.doc_to_visual
+doc_to_text: !function utils.doc_to_text # Such that {{prompt}} will be replaced by doc["question"]
+doc_to_target: "answer"
+generation_kwargs:
+  until:
+    - "ASSISTANT:"
+  max_new_tokens: 1024
+  temperature: 0
+  top_p: 1.0
+  num_beams: 1
+  do_sample: false
+process_results: !function dc200_cn_utils.process_results # apply gpt eval here
+metric_list:
+  - metric: gpt_eval_info
+    aggregation: !function dc200_cn_utils.dc200_cn_aggregate_info
+    higher_is_better: false
+  - metric: gpt_eval_avg_score
+    aggregation: !function dc200_cn_utils.dc200_cn_aggregate_avg_score
+    higher_is_better: true
+metadata:
+  version: 0.0
+  gpt_eval_model_name: "gpt-4-vision-preview"
+include: _default_template_internal_eval_yaml
\ No newline at end of file
diff --git a/lmms_eval/tasks/internal_eval/dc200_cn_utils.py b/lmms_eval/tasks/internal_eval/dc200_cn_utils.py
new file mode 100755
index 00000000..f5d7e87d
--- /dev/null
+++ b/lmms_eval/tasks/internal_eval/dc200_cn_utils.py
@@ -0,0 +1,130 @@
+import base64
+import requests
+import re
+import logging
+import os
+import yaml
+import json
+from pathlib import Path
+from io import BytesIO
+import time
+
+from lmms_eval.tasks._task_utils.file_utils import generate_submission_file
+
+
+def doc_to_visual(doc):
+    if doc["image"] is None:
+        return []
+    return [doc["image"].convert("RGB")]
+
+
+eval_logger = logging.getLogger("lmms-eval")
+
+# Assuming the config is loaded similarly as in d170_en/utils.py
+with open(Path(__file__).parent / "dc200_cn.yaml", "r") as f:
+    raw_data = f.readlines()
+    safe_data = []
+    for i, line in enumerate(raw_data):
+        if "!function" not in line:
+            safe_data.append(line)
+    config = yaml.safe_load("".join(safe_data))
+
+API_URL = os.getenv("OPENAI_API_URL", "https://api.openai.com/v1/chat/completions")
+API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_API_KEY")
+GPT_EVAL_MODEL_NAME = config["metadata"]["gpt_eval_model_name"]
+
+EVALUATION_PROMPT_TEMPLATE_SIMPLE_V1 = """Text Caption: {caption}
+From 0 to 100, how much do you rate for this Text Caption in terms of the correct and comprehensive description of the image?
+Do not dominant the rating by a single attribute such as recognition correctness, but a overall rating on the object/scene appearance, position, pose, action, shape, etc., and contents in the background. 
+Do not consider the appropriateness or sensitive descriptors, such as "middle-aged western man", judge based on if it has correct specifications of the object and scenes in image.
+Provide a few lines for explanation and the rate number at last after "Final Score:"."""
+
+
+def get_chat_response(base64_image, prompt, max_retries=5, wait_time=10):
+    headers = {
+        "Authorization": f"Bearer {API_KEY}",
+        "Content-Type": "application/json",
+    }
+
+    payload = {
+        "model": GPT_EVAL_MODEL_NAME,
+        "messages": [
+            {
+                "role": "user",
+                "content": [
+                    {"type": "text", "text": prompt},
+                    {
+                        "type": "image_url",
+                        "image_url": f"data:image/jpeg;base64,{base64_image}",
+                    },
+                ],
+            }
+        ],
+        "max_tokens": 1024,
+        "temperature": 0.0,
+    }
+
+    for attempt in range(max_retries):
+        try:
+            response = requests.post(
+                API_URL,
+                headers=headers,
+                json=payload,
+                timeout=60,
+            )
+            response.raise_for_status()
+            response_data = response.json()
+            return response_data["choices"][0]["message"]["content"]
+        except requests.exceptions.RequestException as e:
+            eval_logger.warning(f"Request failed on attempt {attempt+1}: {e}")
+            time.sleep(wait_time)
+            if attempt == max_retries - 1:
+                eval_logger.error(f"Failed to get response after {max_retries} attempts")
+                return ""
+        except Exception as e:
+            eval_logger.error(f"Error on attempt {attempt+1}: {e}")
+            return ""
+
+
+def image_to_base64(pil_image):
+    buffered = BytesIO()
+    pil_image.save(buffered, format="PNG")
+    return base64.b64encode(buffered.getvalue()).decode("utf-8")
+
+
+def process_results(doc, results):
+    prediction = results[0]
+    question_id = doc["question_id"]
+    image_path = doc["image"]
+    base64_image = image_to_base64(image_path)
+    prompt = EVALUATION_PROMPT_TEMPLATE_SIMPLE_V1.format(caption=prediction)
+    try:
+        response = get_chat_response(base64_image, prompt)
+        score_value = re.search(r"Final Score: (\d+)", response)
+        score = int(score_value.group(1)) if score_value else 0
+    except Exception as e:
+        eval_logger.error(f"After retrying, still error for Question ID: {question_id}: {e}")
+        score = 0
+        response = "Failed to get GPT4 eval response."
+
+    return {
+        "gpt_eval_info": {"question_id": question_id, "question": doc["question"], "model_caption": prediction, "explanation": response, "eval_model": GPT_EVAL_MODEL_NAME, "score": score, "prompt": prompt},
+        "gpt_eval_avg_score": {
+            "score": score,
+        },
+    }
+
+
+def dc200_cn_aggregate_info(results, args):
+    path = generate_submission_file("dc200_cn_eval_info.json", args)
+    with open(path, "w") as f:
+        json.dump(results, f)
+    eval_logger.info(f"Results saved to {path}.")
+
+
+def dc200_cn_aggregate_avg_score(results):
+    total_score = 0
+    for result in results:
+        total_score += result["score"]
+    avg_score = total_score / len(results)
+    return avg_score
diff --git a/lmms_eval/tasks/internal_eval/internal_eval.yaml b/lmms_eval/tasks/internal_eval/internal_eval.yaml
new file mode 100755
index 00000000..d642e05b
--- /dev/null
+++ b/lmms_eval/tasks/internal_eval/internal_eval.yaml
@@ -0,0 +1,6 @@
+group: internal_eval
+task:
+- d170_cn
+- d170_en
+- dc100_en
+- dc200_cn
diff --git a/lmms_eval/tasks/internal_eval/utils.py b/lmms_eval/tasks/internal_eval/utils.py
new file mode 100755
index 00000000..aab8ab3c
--- /dev/null
+++ b/lmms_eval/tasks/internal_eval/utils.py
@@ -0,0 +1,7 @@
+def doc_to_text(doc, model_specific_prompt_kwargs=None):
+    if model_specific_prompt_kwargs is None:
+        return doc["question"]
+    question = doc["question"]
+    pre_prompt = model_specific_prompt_kwargs.get("pre_prompt", "")
+    post_prompt = model_specific_prompt_kwargs.get("post_prompt", "")
+    return f"{pre_prompt}{question}{post_prompt}"
diff --git a/lmms_eval/tasks/llava-bench-coco/llava-bench-coco.yaml b/lmms_eval/tasks/llava-bench-coco/llava-bench-coco.yaml
old mode 100644
new mode 100755
index 0f7d3352..b104bcb7
--- a/lmms_eval/tasks/llava-bench-coco/llava-bench-coco.yaml
+++ b/lmms_eval/tasks/llava-bench-coco/llava-bench-coco.yaml
@@ -13,7 +13,7 @@ generation_kwargs:
   image_aspect_ratio: original
   max_new_tokens: 1024
   temperature: 0
-  top_p: 0
+  top_p: 1.0
   num_beams: 1
 process_results: !function utils.llava_process_results
 metric_list:
diff --git a/lmms_eval/tasks/llava-bench-coco/rule.json b/lmms_eval/tasks/llava-bench-coco/rule.json
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/llava-bench-coco/utils.py b/lmms_eval/tasks/llava-bench-coco/utils.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/llava-in-the-wild/llava-in-the-wild.yaml b/lmms_eval/tasks/llava-in-the-wild/llava-in-the-wild.yaml
old mode 100644
new mode 100755
index 53520279..02e846c3
--- a/lmms_eval/tasks/llava-in-the-wild/llava-in-the-wild.yaml
+++ b/lmms_eval/tasks/llava-in-the-wild/llava-in-the-wild.yaml
@@ -11,9 +11,9 @@ generation_kwargs:
   until:
     - "ASSISTANT:"
   image_aspect_ratio: original
-  max_new_tokens: 1024
+  max_new_tokens: 32768
   temperature: 0
-  top_p: 0
+  top_p: 1.0
   num_beams: 1
   do_sample: false
 process_results: !function utils.llava_process_results
diff --git a/lmms_eval/tasks/llava-in-the-wild/rule.json b/lmms_eval/tasks/llava-in-the-wild/rule.json
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/llava-in-the-wild/utils.py b/lmms_eval/tasks/llava-in-the-wild/utils.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/llava_wilder/_default_template_wilder_yaml b/lmms_eval/tasks/llava_wilder/_default_template_wilder_yaml
new file mode 100644
index 00000000..37b744f5
--- /dev/null
+++ b/lmms_eval/tasks/llava_wilder/_default_template_wilder_yaml
@@ -0,0 +1,22 @@
+output_type: generate_until
+doc_to_visual: !function utils.llava_doc_to_visual
+doc_to_text: !function utils.llava_doc_to_text
+doc_to_target: "gpt4v_answer"
+generation_kwargs:
+  until:
+    - "ASSISTANT:"
+  image_aspect_ratio: original
+  max_new_tokens: 4096
+  temperature: 0
+  top_p: 1.0
+  num_beams: 1
+  do_sample: false
+process_results: !function utils.llava_process_results
+metric_list:
+  - metric: gpt_eval_llava_all
+    aggregation: !function utils.llava_all_aggregation
+    higher_is_better: true
+metadata:
+  version: 0.0
+  api_type : openai
+  gpt_eval_model_name: "gpt-4-vision-preview"
\ No newline at end of file
diff --git a/lmms_eval/tasks/llava_wilder/llava_wilder_full.yaml b/lmms_eval/tasks/llava_wilder/llava_wilder_full.yaml
new file mode 100755
index 00000000..65627d03
--- /dev/null
+++ b/lmms_eval/tasks/llava_wilder/llava_wilder_full.yaml
@@ -0,0 +1,14 @@
+dataset_path: lmms-lab/llava-wilder
+dataset_name: Full
+dataset_kwargs:
+  token: True
+task: "llava_wilder_full"
+test_split: test
+model_specific_prompt_kwargs:
+  default:
+    pre_prompt: ""
+    post_prompt: ""
+  xcomposer2_4khd:
+    pre_prompt: "[UNUSED_TOKEN_146]user\nQuestion: "
+    post_prompt: "[UNUSED_TOKEN_145]\n[UNUSED_TOKEN_146]assistant\n"
+include: _default_template_wilder_yaml
\ No newline at end of file
diff --git a/lmms_eval/tasks/llava_wilder/llava_wilder_medium.yaml b/lmms_eval/tasks/llava_wilder/llava_wilder_medium.yaml
new file mode 100644
index 00000000..90f7bcc2
--- /dev/null
+++ b/lmms_eval/tasks/llava_wilder/llava_wilder_medium.yaml
@@ -0,0 +1,14 @@
+dataset_path: lmms-lab/llava-wilder
+dataset_name: Medium
+dataset_kwargs:
+  token: True
+task: "llava_wilder_medium"
+test_split: test
+model_specific_prompt_kwargs:
+  default:
+    pre_prompt: ""
+    post_prompt: ""
+  xcomposer2_4khd:
+    pre_prompt: "[UNUSED_TOKEN_146]user\nQuestion: "
+    post_prompt: "[UNUSED_TOKEN_145]\n[UNUSED_TOKEN_146]assistant\n"
+include: _default_template_wilder_yaml
\ No newline at end of file
diff --git a/lmms_eval/tasks/llava_wilder/llava_wilder_small.yaml b/lmms_eval/tasks/llava_wilder/llava_wilder_small.yaml
new file mode 100644
index 00000000..54b18897
--- /dev/null
+++ b/lmms_eval/tasks/llava_wilder/llava_wilder_small.yaml
@@ -0,0 +1,14 @@
+dataset_path: lmms-lab/llava-wilder
+dataset_name: Small
+dataset_kwargs:
+  token: True
+task: "llava_wilder_small"
+test_split: train 
+model_specific_prompt_kwargs:
+  default:
+    pre_prompt: ""
+    post_prompt: ""
+  xcomposer2_4khd:
+    pre_prompt: "[UNUSED_TOKEN_146]user\nQuestion: "
+    post_prompt: "[UNUSED_TOKEN_145]\n[UNUSED_TOKEN_146]assistant\n"
+include: _default_template_wilder_yaml
\ No newline at end of file
diff --git a/lmms_eval/tasks/llava_wilder/utils.py b/lmms_eval/tasks/llava_wilder/utils.py
new file mode 100644
index 00000000..15d44438
--- /dev/null
+++ b/lmms_eval/tasks/llava_wilder/utils.py
@@ -0,0 +1,226 @@
+import base64
+import yaml
+import os
+from pathlib import Path
+import requests
+import logging
+import time
+from copy import deepcopy
+import numpy as np
+from http import HTTPStatus
+from io import BytesIO
+
+# Set up a logger
+eval_logger = logging.getLogger("lmms-eval")
+
+# Create a static variable to track if the message has been logged
+if not hasattr(eval_logger, "dashcope_warning_logged"):
+    eval_logger.dashcope_warning_logged = False
+
+try:
+    import dashscope
+except ImportError:
+    if not eval_logger.dashcope_warning_logged:
+        eval_logger.debug("Dashcope not found, make sure you install dashscope to use qwen vl")
+        eval_logger.dashcope_warning_logged = True
+
+NUM_SECONDS_TO_SLEEP = 5
+dir_path = os.path.dirname(os.path.realpath(__file__))
+
+judge_rules = "We would like to request your feedback on the performance of two AI assistants in response to the user question displayed above. The user asks the question on observing an image shown to you. \nPlease rate the helpfulness, relevance, accuracy, level of details of their responses. Each assistant receives an overall score on a scale of 1 to 10, where a higher score indicates better overall performance. Assume assistant 1 always receive a score of 10 and is the correct answer.\nPlease first output a single line containing only two values indicating the scores for Assistant 1 and 2, respectively. The two scores are separated by a space.\nIn the subsequent line, please provide a comprehensive explanation of your evaluation, avoiding any potential bias and ensuring that the order in which the responses were presented does not affect your judgment."
+
+with open(Path(__file__).parent / "_default_template_wilder_yaml", "r") as f:
+    raw_data = f.readlines()
+    safe_data = []
+    for i, line in enumerate(raw_data):
+        # remove function definition since yaml load cannot handle it
+        if "!function" not in line:
+            safe_data.append(line)
+
+    config = yaml.safe_load("".join(safe_data))
+
+GPT_EVAL_MODEL_NAME = config["metadata"]["gpt_eval_model_name"]
+API_TYPE = config["metadata"]["api_type"]
+
+if API_TYPE == "openai":
+    API_URL = os.getenv("OPENAI_API_URL", "https://api.openai.com/v1/chat/completions")
+    API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_API_KEY")
+    headers = {
+        "Authorization": f"Bearer {API_KEY}",
+        "Content-Type": "application/json",
+    }
+
+elif API_TYPE == "azure":
+    API_URL = os.getenv("AZURE_ENDPOINT", "https://api.cognitive.microsoft.com/sts/v1.0/issueToken")
+    API_KEY = os.getenv("AZURE_API_KEY", "YOUR_API_KEY")
+    headers = {
+        "api-key": API_KEY,
+        "Content-Type": "application/json",
+    }
+
+elif API_TYPE == "qwen_vl":
+    API_URL = os.getenv("QWEN_ENDPOINT", "https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation")
+    API_KEY = os.getenv("DASHSCOPE_API_KEY", "YOUR_API_KEY")
+    headers = {
+        "Authorization": f"Bearer {API_KEY}",
+        "Content-Type": "application/json",
+    }
+
+
+def get_chat_response(base64_image, prompt, max_retries=5, wait_time=10):
+    headers = {
+        "Authorization": f"Bearer {API_KEY}",
+        "Content-Type": "application/json",
+    }
+
+    payload = {
+        "model": GPT_EVAL_MODEL_NAME,
+        "messages": [
+            {
+                "role": "user",
+                "content": [
+                    {"type": "text", "text": prompt},
+                    {
+                        "type": "image_url",
+                        "image_url": f"data:image/jpeg;base64,{base64_image}",
+                    },
+                ],
+            }
+        ],
+        "max_tokens": 1024,
+        "temperature": 0.0,
+    }
+
+    for attempt in range(max_retries):
+        try:
+            response = requests.post(API_URL, headers=headers, json=payload, timeout=60)
+            response.raise_for_status()
+            response_data = response.json()
+            return response_data["choices"][0]["message"]["content"], GPT_EVAL_MODEL_NAME
+        except requests.exceptions.RequestException as e:
+            eval_logger.warning(f"Request failed on attempt {attempt+1}: {e}")
+            time.sleep(wait_time)
+            if attempt == max_retries - 1:
+                eval_logger.error(f"Failed to get response after {max_retries} attempts")
+                return "", GPT_EVAL_MODEL_NAME
+        except Exception as e:
+            eval_logger.error(f"Error on attempt {attempt+1}: {e}")
+            return "", GPT_EVAL_MODEL_NAME
+
+
+def image_to_base64(pil_image):
+    buffered = BytesIO()
+    pil_image.save(buffered, format="PNG")
+    return base64.b64encode(buffered.getvalue()).decode("utf-8")
+
+
+def qwen_multimodal_conversation_call(text_content, image_content, retries=5):
+    """Simple single round multimodal conversation call."""
+    messages = [{"role": "user", "content": [{"image": image_content}, {"text": text_content}]}]
+    for attempt in range(retries):
+        try:
+            response_data = dashscope.MultiModalConversation.call(model=GPT_EVAL_MODEL_NAME, messages=messages)
+            # The response status_code is HTTPStatus.OK indicate success,
+            # otherwise indicate request is failed, you can get error code
+            # and message from code and message.
+            content = response_data["output"]["choices"][0]["message"]["content"][0]["text"].strip()
+            if content != "":
+                return content, GPT_EVAL_MODEL_NAME
+            break  # If successful, break out of the loop
+        except Exception as e:
+            eval_logger.info(f"Attempt {attempt + 1} failed with error: {e}")
+            if attempt < retries:  # If we have retries left, sleep and then continue to next attempt
+                time.sleep(NUM_SECONDS_TO_SLEEP)
+            else:  # If this was the last attempt, log and return empty
+                eval_logger.error(f"All {retries} attempts failed. Last error message: {e}")
+                return "", ""
+    return "", ""
+
+
+def parse_score(review):
+    try:
+        score_pair = review.split("\n")[0]
+        score_pair = score_pair.replace(",", " ")
+        sp = score_pair.split(" ")
+        if len(sp) == 2:
+            return [float(sp[0]), float(sp[1])]
+        else:
+            eval_logger.debug(f"Can not split: {review}. Returning [-1, -1]")
+            return [-1, -1]
+    except Exception as e:
+        eval_logger.debug(f"Error: {e}. Returning [-1, -1]")
+        return [-1, -1]
+
+
+def llava_process_results(doc, result):
+    """
+    Args:
+        doc: a instance of the eval dataset
+        results: [pred]
+    Returns:
+        a dictionary with key: metric name (in this case coco_bleu), value: metric value
+    """
+    try:
+        question = doc.get("question", "")
+        ans1 = doc.get("gpt4v_answer", "")
+        ans2 = result[0] if result else ""
+        content = f"[Question]\n{question}\n\n" + f"[Assistant 1]\n{ans1}\n\n[End of Assistant 1]\n\n" + f"[Assistant 2]\n{ans2}\n\n[End of Assistant 2]\n\n" f"[System]\n{judge_rules}\n\n"
+        visuals = llava_doc_to_visual(doc)
+        if API_TYPE == "qwen_vl":
+            file_path = os.path.join(dir_path, f"tmp_{doc['question_id']}.jpg")
+            visuals[0].save(file_path)
+            image_content = "file://" + file_path
+            review, model_name = qwen_multimodal_conversation_call(content, image_content=image_content)
+            os.remove(file_path)
+        elif API_TYPE == "openai":
+            image_path = doc["image"]
+            base64_image = image_to_base64(image_path)
+            review, model_name = get_chat_response(base64_image, content)
+        scores = parse_score(review)
+    except Exception as e:
+        eval_logger.error(f"Error for Question ID: {doc.get('question_id', 'Unknown')}: {e}")
+        review = "Failed to Get a Proper Review."
+        model_name = "Failed Request"
+        scores = [-1, -1]
+
+    data_dict = {"question": question, "ans1": ans1, "ans2": ans2, "review": review, "scores": scores, "eval_model": model_name, "content": content}
+    # return {"gpt_eval_llava_all": review_dict}
+    return {"gpt_eval_llava_all": data_dict}
+
+
+def llava_doc_to_visual(doc):
+    return [doc["image"].convert("RGB")]
+
+
+def llava_doc_to_text(doc, model_specific_prompt_kwargs=None):
+    if model_specific_prompt_kwargs is None:
+        model_specific_prompt_kwargs = {}
+    pre_prompt = model_specific_prompt_kwargs.get("pre_prompt", "")
+    post_prompt = model_specific_prompt_kwargs.get("post_prompt", "")
+    return f"{pre_prompt}{doc['question']}{post_prompt}"
+
+
+def llava_all_aggregation(results):
+    return llava_aggregation(results, "all")
+
+
+def llava_aggregation(results, category):
+    try:
+        scores = []
+        for result in results:
+            if -999 in result["scores"]:
+                continue
+            scores.append(result["scores"])
+
+        stats = np.asarray(scores).mean(0).tolist()
+        stats = [round(x, 3) for x in stats]
+        # gpt4_score_percentage = stats[0] * 10
+        # model_score_percentage = stats[1] * 10
+        # eval_logger.info(f"Category: {category}")
+        # eval_logger.info(f"GPT4 Score: {gpt4_score_percentage:.1f}%")
+        # eval_logger.info(f"Model Score: {model_score_percentage:.1f}%")
+        # eval_logger.info("=========================")
+        return round(stats[1] / stats[0] * 100, 1)
+    except Exception as e:
+        eval_logger.info(f"Error in llava_aggregation: {e}, and in category: {category}")
+        return None
diff --git a/lmms_eval/tasks/mathvista/mathvista.yaml b/lmms_eval/tasks/mathvista/mathvista.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/mathvista/mathvista_evals.py b/lmms_eval/tasks/mathvista/mathvista_evals.py
old mode 100644
new mode 100755
index d40f6093..90d367ec
--- a/lmms_eval/tasks/mathvista/mathvista_evals.py
+++ b/lmms_eval/tasks/mathvista/mathvista_evals.py
@@ -5,6 +5,106 @@
 import logging
 
 eval_logger = logging.getLogger("lmms-eval")
+
+# pids: 799, 681, 615
+shot_examples = [
+    {
+        "question": "How much money does Ruth need to buy a baking dish, a casserole dish, and an ice cream scoop? (Unit: $)",
+        "caption": "The image shows a table with a variety of items on it, including a baking dish, ice cream scoop, casserole dish, and rolling pin. The text in the image says:\n\n```\nbaking dish\n$4.00\nice cream scoop\n$6.00\ncasserole dish\n$3.00\nrolling pin\n$4.00\n```",
+        "ocr": "[([5, 3], 'baking dish'), ([177, 5], '$4.00'), ([7, 41], 'ice cream scoop'), ([177, 37], '$6.00'), ([9, 69], 'casserole dish'), ([177, 69], '$3.00'), ([5, 98], 'rolling pin'), ([177, 101], '$4.00')]",
+        "solution": """
+Find the total cost of a baking dish, a casserole dish, and an ice cream scoop.\n\n$4.00 + $3.00 + $6.00 = $13.00\n\nRuth needs $13.00.
+""",
+        "code": """
+baking_dish_price = 4.00
+casserole_dish_price = 3.00
+ice_cream_scoop_price = 6.00
+
+ans = baking_dish_price + casserole_dish_price + ice_cream_scoop_price
+print(ans)
+""",
+    },
+    {
+        "question": "What is the largest city in the nation where this plane is headquartered?",
+        "choices": ["hong kong", "osaka", "shanghai", "tokyo"],
+        "caption": 'The image shows a large passenger jet parked on a tarmac at an airport. The jet is white with red trim and has a red tail. It is sitting on top of a tarmac next to a building. The jet is being loaded with passengers and cargo. The text on the image says "Japan. Endless Discovery".',
+        "solution": """
+The caption mentions that the text on the image says "Japan. Endless Discovery". This indicates that the plane is headquartered in Japan. 
+
+Among the Japanese cities, Tokyo is the largest city.
+
+Thus, the answer is D (tokyo).
+""",
+        "code": """
+def largest_city(caption, choices):
+    countries_largest_cities = {
+        'Japan': 'tokyo',
+        'China': 'shanghai'
+    }
+
+    if "Japan" in caption:
+        country = 'Japan'
+    elif "China" in caption:
+        country = 'China'
+
+    for choice in choices:
+        if choice == countries_largest_cities[country]:
+            return choice
+    return ""
+
+choices = ['hong kong', 'osaka', 'shanghai', 'tokyo']
+caption = "The image shows a large passenger jet parked on a tarmac at an airport. The jet is white with red trim and has a red tail. It is sitting on top of a tarmac next to a building. The jet is being loaded with passengers and cargo. The text on the image says 'Japan. Endless Discovery'."
+
+print(largest_city(caption, choices))
+""",
+    },
+    {
+        "question": "If two sides of a triangle measure 12 and 7, which of the following cannot be the perimeter of the triangle?",
+        "choices": ["29", "34", "37", "38"],
+        "caption": "The image shows a triangle with two sides labeled 7 and 12. The triangle is drawn on a white background. There is no text other than the labels.",
+        "ocr": "[([70, 74], '7'), ([324, 74], '12')]",
+        "solution": """
+To determine which of the given perimeters cannot be possible for the triangle, we apply the triangle inequality theorem. The sum of any two sides of a triangle must be greater than the third side.
+
+For the maximum possible value of the third side:
+12 + 7 = 19
+
+The minimum possible value for the third side:
+12 - 7 = 5
+
+The third side for each option:
+(A) 29 - 12 - 7 = 10 (valid)
+(B) 34 - 12 - 7 = 15 (valid)
+(C) 37 - 12 - 7 = 18 (valid)
+(D) 38 - 12 - 7 = 19 (invalid because it should be less than 19)
+
+Thus, the answer is D.
+""",
+        "code": """
+def is_valid_triangle(a, b, perimeter):
+    # Given a and b, find the third side
+    third_side = perimeter - a - b
+    
+    # Check triangle inequality
+    if (a + b > third_side) and (a + third_side > b) and (b + third_side > a):
+        return True
+    return False
+
+# Given sides
+a = 12
+b = 7
+
+# Given perimeters
+perimeters = [29, 34, 37, 38]
+
+# Check which perimeter is not valid
+for p in perimeters:
+    if not is_valid_triangle(a, b, p):
+        print(p)
+""",
+    },
+]
+
 DEMO_PROMPT = """
 Please read the following example. Then extract the answer from the model response and type it at the end of the prompt.
 
@@ -62,7 +162,7 @@ def _post_request(self, payload):
         response.raise_for_status()
         return response.json()
 
-    def get_chat_response(self, prompt, temperature=0, max_tokens=256, n=1, patience=10000000, sleep_time=0):
+    def get_chat_response(self, prompt, temperature=0, max_tokens=256, n=1, patience=5, sleep_time=0):
         messages = [
             {"role": "user", "content": prompt},
         ]
@@ -243,7 +343,7 @@ def get_acc_with_contion(self, res_pd, key, value):
         acc = "{:.2f}".format(len(correct_pd) / len(total_pd) * 100) if len(total_pd) > 0 else "0.00"
         return len(correct_pd), len(total_pd), acc
 
-    def create_one_query(self, problem, shot_type, examples=None, shot_num=0, use_caption=False, use_ocr=False):
+    def create_one_query(self, problem, shot_type, examples=shot_examples, shot_num=0, use_caption=False, use_ocr=False):
         ### [1] Demo prompt
         if shot_num == 0:
             demo_prompt = ""
diff --git a/lmms_eval/tasks/mathvista/mathvista_test.yaml b/lmms_eval/tasks/mathvista/mathvista_test.yaml
old mode 100644
new mode 100755
index 171e06d2..31fc2a45
--- a/lmms_eval/tasks/mathvista/mathvista_test.yaml
+++ b/lmms_eval/tasks/mathvista/mathvista_test.yaml
@@ -12,7 +12,7 @@ generation_kwargs:
     - "ASSISTANT:"
   max_new_tokens: 1024
   temperature: 0
-  top_p: 0
+  top_p: 1.0
   num_beams: 1
   do_sample: false
 process_results: !function utils.mathvista_process_results
diff --git a/lmms_eval/tasks/mathvista/mathvista_testmini.yaml b/lmms_eval/tasks/mathvista/mathvista_testmini.yaml
old mode 100644
new mode 100755
index bb6a938a..3f67431b
--- a/lmms_eval/tasks/mathvista/mathvista_testmini.yaml
+++ b/lmms_eval/tasks/mathvista/mathvista_testmini.yaml
@@ -12,7 +12,7 @@ generation_kwargs:
     - "ASSISTANT:"
   max_new_tokens: 1024
   temperature: 0
-  top_p: 0
+  top_p: 1.0
   num_beams: 1
   do_sample: false
 process_results: !function utils.mathvista_process_results
@@ -24,6 +24,9 @@ metric_list:
 model_specific_prompt_kwargs:
   default:
     shot_type: "format-prompt" # can be "reason-first", "solution", "step-by-step"
+    shot: 0
+    use_caption: False
+    use_ocr: False
   phi3v:
     shot_type: "solution"
 model_specific_generation_kwargs:
diff --git a/lmms_eval/tasks/mathvista/utils.py b/lmms_eval/tasks/mathvista/utils.py
old mode 100644
new mode 100755
index 620e3f28..3bbe96a9
--- a/lmms_eval/tasks/mathvista/utils.py
+++ b/lmms_eval/tasks/mathvista/utils.py
@@ -38,7 +38,13 @@ def mathvista_doc_to_text(doc, model_specific_prompt_kwargs=None):
         "answer": doc["answer"] if "answer" in doc else None,
         "precision": doc["precision"] if "precision" in doc else 0,
     }
-    query_prompt = mathvista_evaluator.create_one_query(problem, examples=None, shot_num=0, shot_type=model_specific_prompt_kwargs["shot_type"])
+    query_prompt = mathvista_evaluator.create_one_query(
+        problem,
+        shot_num=model_specific_prompt_kwargs["shot"],
+        shot_type=model_specific_prompt_kwargs["shot_type"],
+        use_caption=model_specific_prompt_kwargs["use_caption"],
+        use_ocr=model_specific_prompt_kwargs["use_ocr"],
+    )
     return query_prompt
 
 
diff --git a/lmms_eval/tasks/mmbench/_default_template_mmbench_cn_yaml b/lmms_eval/tasks/mmbench/_default_template_mmbench_cn_yaml
old mode 100644
new mode 100755
index 81094620..6a699725
--- a/lmms_eval/tasks/mmbench/_default_template_mmbench_cn_yaml
+++ b/lmms_eval/tasks/mmbench/_default_template_mmbench_cn_yaml
@@ -9,7 +9,7 @@ doc_to_text: !function cn_utils.mmbench_doc_to_text
 generation_kwargs:
   max_new_tokens: 256
   temperature: 0
-  top_p: 0
+  top_p: 1.0
   num_beams: 1
   do_sample: false
 process_results: !function cn_utils.mmbench_process_results
diff --git a/lmms_eval/tasks/mmbench/_default_template_mmbench_en_yaml b/lmms_eval/tasks/mmbench/_default_template_mmbench_en_yaml
old mode 100644
new mode 100755
index ab2b882c..060bd5d1
--- a/lmms_eval/tasks/mmbench/_default_template_mmbench_en_yaml
+++ b/lmms_eval/tasks/mmbench/_default_template_mmbench_en_yaml
@@ -20,6 +20,6 @@ generation_kwargs:
     - "ASSISTANT:"
   max_new_tokens: 1024
   temperature: 0
-  top_p: 0
+  top_p: 1.0
   num_beams: 1
   do_sample: false
diff --git a/lmms_eval/tasks/mmbench/cc_utils.py b/lmms_eval/tasks/mmbench/cc_utils.py
old mode 100644
new mode 100755
index abb24ab1..a5b9326c
--- a/lmms_eval/tasks/mmbench/cc_utils.py
+++ b/lmms_eval/tasks/mmbench/cc_utils.py
@@ -28,7 +28,9 @@
 elif API_TYPE == "azure":
     API_URL = os.getenv("AZURE_ENDPOINT", "https://api.cognitive.microsoft.com/sts/v1.0/issueToken")
     API_KEY = os.getenv("AZURE_API_KEY", "YOUR_API_KEY")
-
+else:
+    API_URL = "YOUR_API_URL"
+    API_KEY = "YOUR_API_KEY"
 
 mmbench_evaluator = MMBench_Evaluator(sys_prompt=config["metadata"]["sys_prompt"], API_KEY=API_KEY, API_URL=API_URL, model_version=GPT_EVAL_MODEL_NAME)
 
diff --git a/lmms_eval/tasks/mmbench/cn_utils.py b/lmms_eval/tasks/mmbench/cn_utils.py
old mode 100644
new mode 100755
index 39a55f72..ba076b6f
--- a/lmms_eval/tasks/mmbench/cn_utils.py
+++ b/lmms_eval/tasks/mmbench/cn_utils.py
@@ -29,6 +29,9 @@
 elif API_TYPE == "azure":
     API_URL = os.getenv("AZURE_ENDPOINT", "https://api.cognitive.microsoft.com/sts/v1.0/issueToken")
     API_KEY = os.getenv("AZURE_API_KEY", "YOUR_API_KEY")
+else:
+    API_URL = "YOUR_API_URL"
+    API_KEY = "YOUR_API_KEY"
 
 
 mmbench_evaluator = MMBench_Evaluator(sys_prompt=config["metadata"]["sys_prompt"], API_KEY=API_KEY, API_URL=API_URL, model_version=GPT_EVAL_MODEL_NAME)
diff --git a/lmms_eval/tasks/mmbench/en_utils.py b/lmms_eval/tasks/mmbench/en_utils.py
old mode 100644
new mode 100755
index 1ddccbb6..60121ec5
--- a/lmms_eval/tasks/mmbench/en_utils.py
+++ b/lmms_eval/tasks/mmbench/en_utils.py
@@ -28,6 +28,9 @@
 elif API_TYPE == "azure":
     API_URL = os.getenv("AZURE_ENDPOINT", "https://api.cognitive.microsoft.com/sts/v1.0/issueToken")
     API_KEY = os.getenv("AZURE_API_KEY", "YOUR_API_KEY")
+else:
+    API_URL = "YOUR_API_URL"
+    API_KEY = "YOUR_API_KEY"
 
 
 mmbench_evaluator = MMBench_Evaluator(sys_prompt=config["metadata"]["sys_prompt"], API_KEY=API_KEY, API_URL=API_URL, model_version=GPT_EVAL_MODEL_NAME)
diff --git a/lmms_eval/tasks/mmbench/mmbench.yaml b/lmms_eval/tasks/mmbench/mmbench.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/mmbench/mmbench_cc.yaml b/lmms_eval/tasks/mmbench/mmbench_cc.yaml
old mode 100644
new mode 100755
index 4a0d5895..9b7b3cb1
--- a/lmms_eval/tasks/mmbench/mmbench_cc.yaml
+++ b/lmms_eval/tasks/mmbench/mmbench_cc.yaml
@@ -11,7 +11,7 @@ doc_to_target: "answer"
 generation_kwargs:
   max_new_tokens: 256
   temperature: 0
-  top_p: 0
+  top_p: 1.0
   num_beams: 1
   do_sample: false
 process_results: !function cc_utils.mmbench_cn_cc_process_results
diff --git a/lmms_eval/tasks/mmbench/mmbench_cn.yaml b/lmms_eval/tasks/mmbench/mmbench_cn.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/mmbench/mmbench_cn_dev.yaml b/lmms_eval/tasks/mmbench/mmbench_cn_dev.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/mmbench/mmbench_cn_test.yaml b/lmms_eval/tasks/mmbench/mmbench_cn_test.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/mmbench/mmbench_en.yaml b/lmms_eval/tasks/mmbench/mmbench_en.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/mmbench/mmbench_en_dev.yaml b/lmms_eval/tasks/mmbench/mmbench_en_dev.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/mmbench/mmbench_en_test.yaml b/lmms_eval/tasks/mmbench/mmbench_en_test.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/mmbench/mmbench_evals.py b/lmms_eval/tasks/mmbench/mmbench_evals.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/mme/mme.yaml b/lmms_eval/tasks/mme/mme.yaml
old mode 100644
new mode 100755
index 504e6dd0..489745aa
--- a/lmms_eval/tasks/mme/mme.yaml
+++ b/lmms_eval/tasks/mme/mme.yaml
@@ -10,7 +10,7 @@ doc_to_target: "answer"
 generation_kwargs:
   max_new_tokens: 16
   temperature: 0
-  top_p: 0
+  top_p: 1.0
   num_beams: 1
   do_sample: false
 # The return value of process_results will be used by metrics
@@ -27,11 +27,17 @@ model_specific_prompt_kwargs:
   default:
     pre_prompt: ""
     post_prompt: "\nAnswer the question using a single word or phrase."
+  gpt4v:
+    pre_prompt: ""
+    post_prompt: "\nAnswer the question with Yes or No."
   qwen_vl:  
     pre_prompt: ""
     post_prompt: " Answer:"
   otterhd:
     pre_prompt: ""
     post_prompt: " Answer:"
+  xcomposer2_4khd:
+    pre_prompt: "[UNUSED_TOKEN_146]user\n"
+    post_prompt: " Answer this question briefly[UNUSED_TOKEN_145]\n[UNUSED_TOKEN_146]assistant\n"
 metadata:
   - version: 0.0
diff --git a/lmms_eval/tasks/mme/utils.py b/lmms_eval/tasks/mme/utils.py
old mode 100644
new mode 100755
index b001b2c0..26d5942c
--- a/lmms_eval/tasks/mme/utils.py
+++ b/lmms_eval/tasks/mme/utils.py
@@ -56,6 +56,13 @@ def parse_pred_ans(pred_ans):
     pred_label = None
     if pred_ans in ["yes", "no"]:
         pred_label = pred_ans
+    elif len(pred_ans) == 1:
+        if pred_ans == "y":
+            pred_label = "yes"
+        elif pred_ans == "n":
+            pred_label = "no"
+        else:
+            pred_label = "other"
     else:
         prefix_pred_ans = pred_ans[:4]
         if "yes" in prefix_pred_ans:
@@ -107,7 +114,7 @@ def mme_aggregate_results(results):
     for category, question2scores in category2score.items():
         total_score = 0
         for question_id, scores in question2scores.items():
-            assert len(scores) == 2
+            assert len(scores) == 2, "MME only supports pairwise evaluation"
             acc = sum(scores) / len(scores) * 100.0
             acc_plus = (sum(scores) == 2) * 100.0
             score = acc_plus + acc
diff --git a/lmms_eval/tasks/mmmu/arial.ttf b/lmms_eval/tasks/mmmu/arial.ttf
new file mode 100644
index 00000000..7ff88f22
Binary files /dev/null and b/lmms_eval/tasks/mmmu/arial.ttf differ
diff --git a/lmms_eval/tasks/mmmu/mmmu.yaml b/lmms_eval/tasks/mmmu/mmmu.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/mmmu/mmmu_group_img_val.yaml b/lmms_eval/tasks/mmmu/mmmu_group_img_val.yaml
index c9f9239e..aad2e6a8 100644
--- a/lmms_eval/tasks/mmmu/mmmu_group_img_val.yaml
+++ b/lmms_eval/tasks/mmmu/mmmu_group_img_val.yaml
@@ -9,7 +9,7 @@ doc_to_target: "answer"
 process_results: !function utils_group_img.mmmu_process_results
 # Note that the metric name can be either a registed metric function (such as the case for GQA) or a key name returned by process_results
 generation_kwargs:
-  max_new_tokens: 16
+  max_new_tokens: 128
 model_specific_generation_kwargs:
   llava:
     image_aspect_ratio: original
diff --git a/lmms_eval/tasks/mmmu/mmmu_test.yaml b/lmms_eval/tasks/mmmu/mmmu_test.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/mmmu/mmmu_val.yaml b/lmms_eval/tasks/mmmu/mmmu_val.yaml
old mode 100644
new mode 100755
index 49ab0dd1..9e1574c4
--- a/lmms_eval/tasks/mmmu/mmmu_val.yaml
+++ b/lmms_eval/tasks/mmmu/mmmu_val.yaml
@@ -9,7 +9,7 @@ doc_to_target: "answer"
 process_results: !function utils.mmmu_process_results
 # Note that the metric name can be either a registed metric function (such as the case for GQA) or a key name returned by process_results
 generation_kwargs:
-  max_new_tokens: 16
+  max_new_tokens: 128
 model_specific_generation_kwargs:
   llava:
     image_aspect_ratio: original
diff --git a/lmms_eval/tasks/mmmu/utils.py b/lmms_eval/tasks/mmmu/utils.py
old mode 100644
new mode 100755
index bc68c00e..5a1b91f6
--- a/lmms_eval/tasks/mmmu/utils.py
+++ b/lmms_eval/tasks/mmmu/utils.py
@@ -11,16 +11,16 @@
 
 lmms_logger = logging.getLogger("lmms-eval")
 
-MULTI_CHOICE_PROMPT = "Answer with the option letter from the given choices directly."
+MULTI_CHOICE_PROMPT = "Answer with the option's letter from the given choices directly."
 OPEN_ENDED_PROMPT = "Answer the question using a single word or phrase."
 
 
 def replace_images_tokens(input_string):
-    for i in range(1, 8):
-        question_text = f"<image {i}>"
-        query_text = "<image>"
-        if question_text in input_string:
-            input_string = input_string.replace(question_text, query_text)
+    # for i in range(1, 8):
+    #     question_text = f"<image {i}>"
+    #     query_text = "<image>"
+    #     if question_text in input_string:
+    #         input_string = input_string.replace(question_text, query_text)
     return input_string
 
 
@@ -36,9 +36,9 @@ def construct_prompt(doc):
         # Weirdly, data["options"] is a string in MMMU Huggingface dataset
         parsed_options = parse_options(ast.literal_eval(doc["options"]))
         # parsed_options already prepends a newline so no need to add space here
-        question = f"{question}\n{parsed_options}\n{MULTI_CHOICE_PROMPT}"
+        question = f"{question}\n{parsed_options}\n\n{MULTI_CHOICE_PROMPT}"
     else:
-        question = f"{question}\n{OPEN_ENDED_PROMPT}"
+        question = f"{question}\n\n{OPEN_ENDED_PROMPT}"
     return question
 
 
@@ -51,7 +51,7 @@ def mmmu_doc_to_visual(doc):
     prompt = construct_prompt(doc)
     image_tokens = re.findall(r"<image \d+>", prompt)
     # Remove <> and  swap space as _
-    image_tokens = [image_token.strip("<>").replace(" ", "_") for image_token in image_tokens]
+    image_tokens = sorted(list(set([image_token.strip("<>").replace(" ", "_") for image_token in image_tokens])))
     visual = [doc[image_token].convert("RGB") for image_token in image_tokens]
     return visual
 
@@ -112,18 +112,18 @@ def mmmu_aggregate_results(results):
         in_domain_data_num = sum([cat_results["num_example"] for cat_results in in_domain_cat_results.values()])
         printable_results["Overall-" + domain] = {
             "num": int(in_domain_data_num),
-            "acc": round(in_domain_ins_acc, 3),
+            "acc": round(in_domain_ins_acc, 5),
         }
         # add sub category
         for cat_name, cat_results in in_domain_cat_results.items():
             printable_results[cat_name] = {
                 "num": int(cat_results["num_example"]),
-                "acc": round(cat_results["acc"], 3),
+                "acc": round(cat_results["acc"], 5),
             }
     all_ins_acc = calculate_ins_level_acc(evaluation_result)
     printable_results["Overall"] = {
         "num": sum([cat_results["num_example"] for cat_results in evaluation_result.values()]),
-        "acc": round(all_ins_acc, 3),
+        "acc": round(all_ins_acc, 5),
     }
     print(printable_results)
     return printable_results["Overall"]["acc"]
diff --git a/lmms_eval/tasks/mmmu/utils_group_img.py b/lmms_eval/tasks/mmmu/utils_group_img.py
index e3e475be..2470d7c7 100644
--- a/lmms_eval/tasks/mmmu/utils_group_img.py
+++ b/lmms_eval/tasks/mmmu/utils_group_img.py
@@ -14,12 +14,13 @@
 import numpy as np
 
 
-def add_order_label(image, label, font_family="DejaVu Sans", font_size=40):
+def add_order_label(image, label, font_size=40):
     # Create a drawing context
     draw = ImageDraw.Draw(image)
 
     # Define font for the label
-    font_path = fm.findfont(fm.FontProperties(family=font_family))
+    # font_path = fm.findfont(fm.FontProperties(family=font_family))
+    font_path = "./arial.ttf"
     font = ImageFont.truetype(font_path, font_size)
 
     # Calculate text size and position
@@ -112,7 +113,7 @@ def process_images_vertical(original_images, size):
     return concatenate_images_vertical(images)
 
 
-def process_images(images, size=672):
+def process_images(images, size=1008):
     concat_horizontal = process_images_horizontal(images, size)
     concat_vertical = process_images_vertical(images, size)
 
@@ -130,7 +131,7 @@ def process_images(images, size=672):
 
 lmms_logger = logging.getLogger("lmms-eval")
 
-MULTI_CHOICE_PROMPT = "Answer with the option letter from the given choices directly."
+MULTI_CHOICE_PROMPT = "Answer with the option's letter from the given choices directly."
 OPEN_ENDED_PROMPT = "Answer the question using a single word or phrase."
 
 
@@ -145,7 +146,7 @@ def replace_images_tokens(input_string):
 
 def parse_options(options):
     option_letters = [chr(ord("A") + i) for i in range(len(options))]
-    choices_str = "\n".join([f"{option_letter}. {option}" for option_letter, option in zip(option_letters, options)])
+    choices_str = "\n".join([f"({option_letter}) {option}" for option_letter, option in zip(option_letters, options)])
     return choices_str
 
 
@@ -155,7 +156,7 @@ def construct_prompt(doc):
         # Weirdly, data["options"] is a string in MMMU Huggingface dataset
         parsed_options = parse_options(ast.literal_eval(doc["options"]))
         # parsed_options already prepends a newline so no need to add space here
-        question = f"{question}\n{parsed_options}\n{MULTI_CHOICE_PROMPT}"
+        question = f"{question}\n{parsed_options}\n\n{MULTI_CHOICE_PROMPT}"
     else:
         question = f"{question}\n{OPEN_ENDED_PROMPT}"
     return question
diff --git a/lmms_eval/tasks/mmvet/mmvet.yaml b/lmms_eval/tasks/mmvet/mmvet.yaml
old mode 100644
new mode 100755
index c29c873c..30c1907a
--- a/lmms_eval/tasks/mmvet/mmvet.yaml
+++ b/lmms_eval/tasks/mmvet/mmvet.yaml
@@ -10,9 +10,9 @@ doc_to_target: "{{answer}}"
 generation_kwargs:
   until:
     - "ASSISTANT:"
-  max_new_tokens: 1024
+  max_new_tokens: 32768
   temperature: 0
-  top_p: 0
+  top_p: 1.0
   num_beams: 1
   do_sample: false
 process_results: !function utils.mmvet_process_results # apply gpt eval here
@@ -22,8 +22,8 @@ metric_list:
     higher_is_better: true
 metadata:
   version: 0.0
-  gpt_eval_model_name: "gpt-4"
+  gpt_eval_model_name: "gpt-4-0613"
 model_specific_prompt_kwargs:
   default:
-    pre_prompt: ""
+    pre_prompt: "Please think step by step and try to provide best answer to the following question: \n\n"
     post_prompt: ""
diff --git a/lmms_eval/tasks/mmvet/utils.py b/lmms_eval/tasks/mmvet/utils.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/multidocvqa/multidocvqa.yaml b/lmms_eval/tasks/multidocvqa/multidocvqa.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/multidocvqa/multidocvqa_test.yaml b/lmms_eval/tasks/multidocvqa/multidocvqa_test.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/multidocvqa/multidocvqa_val.yaml b/lmms_eval/tasks/multidocvqa/multidocvqa_val.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/multidocvqa/utils.py b/lmms_eval/tasks/multidocvqa/utils.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/nextqa/_default_template_yaml b/lmms_eval/tasks/nextqa/_default_template_yaml
new file mode 100644
index 00000000..65f3845a
--- /dev/null
+++ b/lmms_eval/tasks/nextqa/_default_template_yaml
@@ -0,0 +1,5 @@
+dataset_path: lmms-lab/NExTQA
+dataset_kwargs:
+  token: True
+  video: True
+  cache_dir: nextqa
diff --git a/lmms_eval/tasks/nextqa/nextqa.yaml b/lmms_eval/tasks/nextqa/nextqa.yaml
new file mode 100755
index 00000000..10eaa4a2
--- /dev/null
+++ b/lmms_eval/tasks/nextqa/nextqa.yaml
@@ -0,0 +1,5 @@
+group: nextqa
+task:
+- nextqa_oe_test
+- nextqa_oe_val
+- nextqa_mc_test
diff --git a/lmms_eval/tasks/nextqa/nextqa_mc_test.yaml b/lmms_eval/tasks/nextqa/nextqa_mc_test.yaml
new file mode 100644
index 00000000..472490bd
--- /dev/null
+++ b/lmms_eval/tasks/nextqa/nextqa_mc_test.yaml
@@ -0,0 +1,13 @@
+task: "nextqa_mc_test"
+dataset_name: MC
+test_split: test
+output_type: generate_until
+doc_to_visual: !function utils.nextqa_doc_to_visual
+doc_to_text: !function utils.nextqa_doc_to_text_mc
+doc_to_target: !function utils.nextqa_doc_to_target
+process_results: !function utils.nextqa_mc_process_results
+metric_list:
+  - metric: exact_match
+    aggregation: mean
+    higher_is_better: true
+include: _default_template_yaml
diff --git a/lmms_eval/tasks/nextqa/nextqa_oe_test.yaml b/lmms_eval/tasks/nextqa/nextqa_oe_test.yaml
new file mode 100644
index 00000000..7e5f2c70
--- /dev/null
+++ b/lmms_eval/tasks/nextqa/nextqa_oe_test.yaml
@@ -0,0 +1,17 @@
+task: "nextqa_oe_test"
+dataset_name: OE
+test_split: test
+output_type: generate_until
+doc_to_visual: !function utils.nextqa_doc_to_visual
+doc_to_text: !function utils.nextqa_doc_to_text
+doc_to_target: !function utils.nextqa_doc_to_target
+process_results: !function utils.nextqa_process_results
+metric_list:
+  - metric: WUPS
+    aggregation: !function utils.nextqa_aggregate_results
+    higher_is_better: true
+include: _default_template_yaml
+model_specific_prompt_kwargs:
+  default:
+    pre_prompt: ""
+    post_prompt: "\nAnswer a question using a short phrase or sentence."
diff --git a/lmms_eval/tasks/nextqa/nextqa_oe_val.yaml b/lmms_eval/tasks/nextqa/nextqa_oe_val.yaml
new file mode 100644
index 00000000..5511ea3a
--- /dev/null
+++ b/lmms_eval/tasks/nextqa/nextqa_oe_val.yaml
@@ -0,0 +1,17 @@
+task: "nextqa_oe_val"
+dataset_name: OE
+test_split: validation
+output_type: generate_until
+doc_to_visual: !function utils.nextqa_doc_to_visual
+doc_to_text: !function utils.nextqa_doc_to_text
+doc_to_target: !function utils.nextqa_doc_to_target
+process_results: !function utils.nextqa_process_results
+metric_list:
+  - metric: WUPS
+    aggregation: !function utils.nextqa_aggregate_results
+    higher_is_better: true
+include: _default_template_yaml
+model_specific_prompt_kwargs:
+  default:
+    pre_prompt: ""
+    post_prompt: "\nAnswer a question using a short phrase or sentence."
diff --git a/lmms_eval/tasks/nextqa/stopwords.csv b/lmms_eval/tasks/nextqa/stopwords.csv
new file mode 100644
index 00000000..049fba06
--- /dev/null
+++ b/lmms_eval/tasks/nextqa/stopwords.csv
@@ -0,0 +1,157 @@
+stopwords
+i
+me
+my
+myself
+we
+our
+ours
+ourselves
+you
+you're
+you've
+you'll
+you'd
+your
+yours
+yourself
+yourselves
+he
+him
+his
+himself
+she
+she's
+her
+hers
+herself
+it
+it's
+its
+itself
+they
+them
+their
+theirs
+themselves
+what
+which
+who
+whom
+this
+that
+that'll
+these
+those
+am
+is
+are
+was
+were
+be
+been
+being
+have
+has
+had
+having
+do
+does
+did
+doing
+a
+an
+the
+and
+but
+if
+or
+because
+as
+until
+while
+to
+from
+of
+at
+for
+with
+about
+into
+through
+during
+again
+further
+then
+here
+there
+when
+where
+why
+how
+all
+any
+each
+most
+other
+some
+such
+only
+own
+so
+than
+too
+very
+s
+t
+can
+will
+just
+don
+don't
+should
+should've
+now
+d
+ll
+m
+o
+re
+ve
+y
+ain
+aren
+aren't
+couldn
+couldn't
+didn
+didn't
+doesn
+doesn't
+hadn
+hadn't
+hasn
+hasn't
+haven
+haven't
+isn
+isn't
+ma
+mightn
+mightn't
+mustn
+mustn't
+needn
+needn't
+shan
+shan't
+shouldn
+shouldn't
+wasn
+wasn't
+weren
+weren't
+won
+won't
+wouldn
+wouldn't
diff --git a/lmms_eval/tasks/nextqa/utils.py b/lmms_eval/tasks/nextqa/utils.py
new file mode 100644
index 00000000..ca5d8bac
--- /dev/null
+++ b/lmms_eval/tasks/nextqa/utils.py
@@ -0,0 +1,341 @@
+import os
+import yaml
+import logging
+import random
+import pandas as pd
+
+from pathlib import Path
+
+eval_logger = logging.getLogger("lmms-eval")
+
+try:
+    from pywsd.utils import lemmatize_sentence
+except ImportError:
+    eval_logger.debug("pywsd not installed. Please install pywsd to use this module. You can install it by running 'pip install pywsd'")
+
+try:
+    from nltk.tokenize import word_tokenize
+    from nltk.corpus import wordnet
+
+    try:
+        import nltk
+
+        nltk.download("averaged_perceptron_tagger", quiet=True)
+        nltk.download("wordnet", quiet=True)
+        nltk.download("punkt", quiet=True)
+    except Exception as e:
+        eval_logger.debug(f"nltk download failed: {e}")
+except ImportError:
+    eval_logger.debug("nltk not installed. Please install nltk to use this module. You can install it by running 'pip install nltk'")
+
+from lmms_eval.tasks._task_utils.video_loader import get_cache_dir, get_video
+import numpy as np
+
+
+OPTIONS = ["A", "B", "C", "D", "E"]
+
+
+with open(Path(__file__).parent / "_default_template_yaml", "r") as f:
+    raw_data = f.readlines()
+    safe_data = []
+    for i, line in enumerate(raw_data):
+        # remove function definition since yaml load cannot handle it
+        if "!function" not in line:
+            safe_data.append(line)
+
+    config = yaml.safe_load("".join(safe_data))
+
+stopwords = set(pd.read_csv(Path(__file__).parent / "stopwords.csv").squeeze())
+
+cache_dir = get_cache_dir(config, "NExTVideo")
+
+
+def nextqa_doc_to_visual(doc):
+    return [get_video(cache_dir, doc["video"])]
+
+
+def nextqa_doc_to_text(doc, model_specific_prompt_kwargs=None):
+    question = doc["question"].strip()
+    if "pre_prompt" in model_specific_prompt_kwargs and model_specific_prompt_kwargs["pre_prompt"] != "":
+        question = f"{model_specific_prompt_kwargs['pre_prompt']}{question}"
+    if "post_prompt" in model_specific_prompt_kwargs and model_specific_prompt_kwargs["post_prompt"] != "":
+        question = f"{question}{model_specific_prompt_kwargs['post_prompt']}"
+    return question
+
+
+def nextqa_doc_to_text_mc(doc, model_specific_prompt_kwargs=None):
+    if model_specific_prompt_kwargs is None:
+        model_specific_prompt_kwargs = {}
+    question = [doc["question"].strip()]
+    for i in range(5):
+        question.append(f"{OPTIONS[i]}. {doc[f'a{i}'].strip()}")
+    question = "\n".join(question)
+    if "pre_prompt" in model_specific_prompt_kwargs and model_specific_prompt_kwargs["pre_prompt"] != "":
+        question = f"{model_specific_prompt_kwargs['pre_prompt']}{question}"
+    if "post_prompt" in model_specific_prompt_kwargs and model_specific_prompt_kwargs["post_prompt"] != "":
+        question = f"{question}{model_specific_prompt_kwargs['post_prompt']}"
+    return question
+
+
+def nextqa_mc_process_results(doc, results):
+    pred = results[0]
+    index2ans, all_choices = get_multi_choice_info(doc)
+    parsed_pred = parse_multi_choice_response(pred, all_choices, index2ans)
+    return {
+        "exact_match": parsed_pred == OPTIONS[doc["answer"]],
+    }
+
+
+def parse_multi_choice_response(response, all_choices, index2ans):
+    """
+    Parse the prediction from the generated response.
+    Return the predicted index e.g., A, B, C, D.
+    https://github.com/MMMU-Benchmark/MMMU/blob/51ce7f3e829c16bb44bc5445782686b4c3508794/eval/eval_utils.py#L10
+    """
+    for char in [",", ".", "!", "?", ";", ":", "'"]:
+        response = response.strip(char)
+    response = " " + response + " "  # add space to avoid partial match
+
+    index_ans = True
+    ans_with_brack = False
+    candidates = []
+    for choice in all_choices:  # e.g., (A) (B) (C) (D)
+        if f"({choice})" in response:
+            candidates.append(choice)
+            ans_with_brack = True
+
+    if len(candidates) == 0:
+        for choice in all_choices:  # e.g., A B C D
+            if f"{choice} " in response:
+                candidates.append(choice)
+
+    if len(candidates) == 0:
+        for choice in all_choices:  # e.g., A. B. C. D.
+            if f"{choice}." in response:
+                candidates.append(choice)
+
+    # if all above doesn't get candidates, check if the content is larger than 5 tokens and try to parse the example
+    if len(candidates) == 0 and len(response.split()) > 5:
+        for index, ans in index2ans.items():
+            if ans.lower() in response.lower():
+                candidates.append(index)
+                index_ans = False  # it's content ans.
+
+    if len(candidates) == 0:  # still not get answer, randomly choose one.
+        pred_index = random.choice(all_choices)
+    elif len(candidates) > 1:
+        start_indexes = []
+        if index_ans:
+            if ans_with_brack:
+                for can in candidates:
+                    index = response.rfind(f"({can})")
+                    start_indexes.append(index)  # -1 will be ignored anyway
+                # start_indexes = [generated_response.index(f'({can})') for can in candidates]
+            else:
+                for can in candidates:
+                    index = response.rfind(f" {can} ")
+                    start_indexes.append(index)
+        else:
+            for can in candidates:
+                index = response.lower().rfind(index2ans[can].lower())
+                start_indexes.append(index)
+        # get the last one
+        pred_index = candidates[np.argmax(start_indexes)]
+    else:  # if only one candidate, use it.
+        pred_index = candidates[0]
+
+    return pred_index
+
+
+def nextqa_doc_to_target(doc):
+    return doc["answer"]
+
+
+def remove_stop(sentence):
+    sentence.replace("</s>", "")  # video-llava
+    words = lemmatize_sentence(sentence)
+    words = [w for w in words if not w in stopwords]
+    return " ".join(words)
+
+
+def get_multi_choice_info(doc):
+    all_choices = []
+    index2ans = {}
+    for i in range(5):
+        index2ans[OPTIONS[i]] = doc[f"a{i}"].strip()
+        all_choices.append(OPTIONS[i])
+
+    return index2ans, all_choices
+
+
+####################### WUPS ################################
+# The following code copied from                            #
+# https://github.com/doc-doc/NExT-OE/blob/main/metrics.py   #
+#############################################################
+
+# ====================================================
+# @Time    : 13/9/20 4:19 PM
+# @Author  : Xiao Junbin
+# @Email   : junbin@comp.nus.edu.sg
+# @File    : metrics.py
+# ====================================================
+
+
+def wup(word1, word2, alpha):
+    """
+    calculate the wup similarity
+    :param word1:
+    :param word2:
+    :param alpha:
+    :return:
+    """
+    # print(word1, word2)
+    if word1 == word2:
+        return 1.0
+
+    w1 = wordnet.synsets(word1)
+    w1_len = len(w1)
+    if w1_len == 0:
+        return 0.0
+    w2 = wordnet.synsets(word2)
+    w2_len = len(w2)
+    if w2_len == 0:
+        return 0.0
+
+    # match the first
+    word_sim = w1[0].wup_similarity(w2[0])
+    if word_sim is None:
+        word_sim = 0.0
+
+    if word_sim < alpha:
+        word_sim = 0.1 * word_sim
+    return word_sim
+
+
+def wups(words1, words2, alpha):
+    """
+
+    :param pred:
+    :param truth:
+    :param alpha:
+    :return:
+    """
+    sim = 1.0
+    flag = False
+    for w1 in words1:
+        max_sim = 0
+        for w2 in words2:
+            word_sim = wup(w1, w2, alpha)
+            if word_sim > max_sim:
+                max_sim = word_sim
+        if max_sim == 0:
+            continue
+        sim *= max_sim
+        flag = True
+    if not flag:
+        sim = 0.0
+    return sim
+
+
+def get_wups(pred, truth, alpha):
+    """
+    calculate the wups score
+    :param pred:
+    :param truth:
+    :return:
+    """
+    pred = word_tokenize(pred)
+    truth = word_tokenize(truth)
+    item1 = wups(pred, truth, alpha)
+    item2 = wups(truth, pred, alpha)
+    value = min(item1, item2)
+    return value
+
+
+################ END WUPS ################################
+
+
+def nextqa_process_results(doc, results):
+    pred = results[0]
+    answer = doc["answer"]
+    pred_ans = remove_stop(pred)
+    gt_ans = remove_stop(answer)
+    qtype = doc["type"]
+    if qtype == "TP":
+        qtype = "TN"
+    add_ref_ans = doc["additional_ref_answer"]
+    if add_ref_ans:
+        add_ref_ans = remove_stop(add_ref_ans)
+        if qtype == "DC" or qtype == "DB":
+            cur_0 = 1 if pred_ans == gt_ans or pred_ans == add_ref_ans else 0
+            cur_9 = cur_0
+        else:
+            cur_0 = max(get_wups(pred_ans, gt_ans, 0), get_wups(pred_ans, add_ref_ans, 0))
+            cur_9 = max(get_wups(pred_ans, gt_ans, 0.9), get_wups(pred_ans, add_ref_ans, 0))
+    else:
+        if qtype == "DC" or qtype == "DB":
+            cur_0 = 1 if pred_ans == gt_ans else 0
+            cur_9 = cur_0
+        else:
+            cur_0 = get_wups(pred_ans, gt_ans, 0)
+            cur_9 = get_wups(pred_ans, gt_ans, 0.9)
+    return {"WUPS": {"0": cur_0, "0.9": cur_9, "qtype": qtype}}
+
+
+def nextqa_aggregate_results(results):
+    qtypes = ["CW", "CH", "TN", "TC", "DB", "DC", "DL", "DO"]
+    num = {"CW": 0, "CH": 0, "TN": 0, "TC": 0, "DB": 0, "DC": 0, "DL": 0, "DO": 0}
+    over_num = {"C": 0, "T": 0, "D": 0}
+    wups0 = {"CW": 0, "CH": 0, "TN": 0, "TC": 0, "DB": 0, "DC": 0, "DL": 0, "DO": 0}
+    wups9 = {"CW": 0, "CH": 0, "TN": 0, "TC": 0, "DB": 0, "DC": 0, "DL": 0, "DO": 0}
+    ref_num = 0
+    for result in results:
+        qtype = result["qtype"]
+        num[qtype] += 1
+        over_num[qtype[0]] += 1
+        ref_num += 1
+        cur_0 = result["0"]
+        cur_9 = result["0.9"]
+        wups0[qtype] += cur_0
+        wups9[qtype] += cur_9
+
+    wups0_all = wups9_all = 0
+    wups0_e = wups0_t = wups0_c = 0
+    for qtype in qtypes:
+        wups0_all += wups0[qtype]
+        wups9_all += wups9[qtype]
+        if qtype[0] == "C":
+            wups0_e += wups0[qtype]
+        if qtype[0] == "T":
+            wups0_t += wups0[qtype]
+        if qtype[0] == "D":
+            wups0_c += wups0[qtype]
+
+        if num[qtype] != 0:
+            wups0[qtype] = wups0[qtype] / num[qtype]
+            wups9[qtype] = wups9[qtype] / num[qtype]
+        else:
+            wups0[qtype] = 0
+            wups9[qtype] = 0
+
+    # num_e = over_num["C"]
+    # num_t = over_num["T"]
+    # num_c = over_num["D"]
+
+    # wups0_e /= num_e
+    # wups0_t /= num_t
+    # wups0_c /= num_c
+
+    wups0_all /= ref_num
+    wups9_all /= ref_num
+
+    for k in qtypes:
+        wups0[k] = wups0[k] * 100
+        wups9[k] = wups9[k] * 100
+
+    # wups0_e *= 100
+    # wups0_t *= 100
+    # wups0_c *= 100
+    wups0_all *= 100
+
+    return wups0_all
diff --git a/lmms_eval/tasks/nocaps/_default_template_nocaps_yaml b/lmms_eval/tasks/nocaps/_default_template_nocaps_yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/nocaps/nocaps.yaml b/lmms_eval/tasks/nocaps/nocaps.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/nocaps/nocaps_test.yaml b/lmms_eval/tasks/nocaps/nocaps_test.yaml
old mode 100644
new mode 100755
index 9f21ce2b..d92ec327
--- a/lmms_eval/tasks/nocaps/nocaps_test.yaml
+++ b/lmms_eval/tasks/nocaps/nocaps_test.yaml
@@ -11,7 +11,7 @@ doc_to_target: "annotations_captions"
 generation_kwargs:
   max_new_tokens: 64
   temperature: 0
-  top_p: 0
+  top_p: 1.0
   num_beams: 1
   do_sample: false
 process_results: !function utils.nocaps_test_process_result
diff --git a/lmms_eval/tasks/nocaps/nocaps_val.yaml b/lmms_eval/tasks/nocaps/nocaps_val.yaml
old mode 100644
new mode 100755
index 048066a6..125c6d77
--- a/lmms_eval/tasks/nocaps/nocaps_val.yaml
+++ b/lmms_eval/tasks/nocaps/nocaps_val.yaml
@@ -11,7 +11,7 @@ doc_to_target: "annotations_captions"
 generation_kwargs:
   max_new_tokens: 64
   temperature: 0
-  top_p: 0
+  top_p: 1.0
   num_beams: 1
   do_sample: false
 process_results: !function utils.nocaps_process_result
diff --git a/lmms_eval/tasks/nocaps/utils.py b/lmms_eval/tasks/nocaps/utils.py
old mode 100644
new mode 100755
index f645b1cc..9b1d4df6
--- a/lmms_eval/tasks/nocaps/utils.py
+++ b/lmms_eval/tasks/nocaps/utils.py
@@ -42,7 +42,7 @@ def nocaps_process_result(doc, result):
 
 
 def nocaps_aggregation_result(results, metric, args=None):
-    scorers = [(Bleu(4), "Bleu_1"), (Bleu(4), "Bleu_2"), (Bleu(4), "Bleu_3"), (Bleu(4), "Bleu_4"), (Meteor(), "METEOR"), (Rouge(), "ROUGE_L"), (Cider(), "CIDEr")]#, (Spice(), "SPICE")]
+    scorers = [(Bleu(4), "Bleu_1"), (Bleu(4), "Bleu_2"), (Bleu(4), "Bleu_3"), (Bleu(4), "Bleu_4"), (Meteor(), "METEOR"), (Rouge(), "ROUGE_L"), (Cider(), "CIDEr")]  # , (Spice(), "SPICE")]
     scorers_dict = {s[1]: s for s in scorers}
 
     stored_results = []
diff --git a/lmms_eval/tasks/ocrbench/ocrbench.yaml b/lmms_eval/tasks/ocrbench/ocrbench.yaml
index 7957e7bf..243c98d3 100644
--- a/lmms_eval/tasks/ocrbench/ocrbench.yaml
+++ b/lmms_eval/tasks/ocrbench/ocrbench.yaml
@@ -10,7 +10,7 @@ doc_to_target: "answer"
 generation_kwargs:
   max_new_tokens: 128
   temperature: 0
-  top_p: 0
+  top_p: 1.0
   num_beams: 1
   do_sample: false
 process_results: !function utils.ocrbench_process_results
diff --git a/lmms_eval/tasks/ocrbench/utils.py b/lmms_eval/tasks/ocrbench/utils.py
index c8c8c650..fe1c1fcb 100644
--- a/lmms_eval/tasks/ocrbench/utils.py
+++ b/lmms_eval/tasks/ocrbench/utils.py
@@ -100,4 +100,4 @@ def ocrbench_aggregate_accuracy(results, args):
         print(f"Final Score(Total 1000): {Final_score}", file=f)
     logger.info(f"OCR Bench results saved to {file_name}")
     # return {"Final Score":Final_score,"Text Recognition":recognition_score,'Scene Text-centric VQA':OCRBench_score['Scene Text-centric VQA'],'Doc-oriented VQA':OCRBench_score['Doc-oriented VQA'],'Key Information Extraction':OCRBench_score['Key Information Extraction'],'Handwritten Mathematical Expression Recognition':OCRBench_score['Handwritten Mathematical Expression Recognition']}
-    return Final_score
+    return Final_score / 1000  # return the final score as accuracy
diff --git a/lmms_eval/tasks/ok_vqa/_default_template_vqa_yaml b/lmms_eval/tasks/ok_vqa/_default_template_vqa_yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/ok_vqa/_generate_config.py b/lmms_eval/tasks/ok_vqa/_generate_config.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/ok_vqa/_ok_vqa.yaml b/lmms_eval/tasks/ok_vqa/_ok_vqa.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/ok_vqa/ok_vqa_val2014.yaml b/lmms_eval/tasks/ok_vqa/ok_vqa_val2014.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/ok_vqa/utils.py b/lmms_eval/tasks/ok_vqa/utils.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/olympiadbench/olympiadbench_test_cn.yaml b/lmms_eval/tasks/olympiadbench/olympiadbench_test_cn.yaml
index 574d0c19..95a13584 100644
--- a/lmms_eval/tasks/olympiadbench/olympiadbench_test_cn.yaml
+++ b/lmms_eval/tasks/olympiadbench/olympiadbench_test_cn.yaml
@@ -12,7 +12,7 @@ generation_kwargs:
     - "ASSISTANT:"
   max_new_tokens: 1024
   temperature: 0
-  top_p: 0
+  top_p: 1.0
   num_beams: 1
   do_sample: false
 process_results: !function cn_utils.olympiadbench_process_results
diff --git a/lmms_eval/tasks/olympiadbench/olympiadbench_test_en.yaml b/lmms_eval/tasks/olympiadbench/olympiadbench_test_en.yaml
index 6d293fb7..c87f7ab4 100644
--- a/lmms_eval/tasks/olympiadbench/olympiadbench_test_en.yaml
+++ b/lmms_eval/tasks/olympiadbench/olympiadbench_test_en.yaml
@@ -12,7 +12,7 @@ generation_kwargs:
     - "ASSISTANT:"
   max_new_tokens: 1024
   temperature: 0
-  top_p: 0
+  top_p: 1.0
   num_beams: 1
   do_sample: false
 process_results: !function en_utils.olympiadbench_process_results
diff --git a/lmms_eval/tasks/perceptiontest/test/_default_template_yaml b/lmms_eval/tasks/perceptiontest/test/_default_template_yaml
new file mode 100644
index 00000000..39ea0aa4
--- /dev/null
+++ b/lmms_eval/tasks/perceptiontest/test/_default_template_yaml
@@ -0,0 +1,9 @@
+dataset_path: lmms-lab/PerceptionTest
+dataset_kwargs:
+  token: True
+  video: True
+  cache_dir: perceptiontest
+model_specific_prompt_kwargs:
+  default:
+    pre_prompt: ""
+    post_prompt: ""
\ No newline at end of file
diff --git a/lmms_eval/tasks/perceptiontest/test/perceptiontest_mc.yaml b/lmms_eval/tasks/perceptiontest/test/perceptiontest_mc.yaml
new file mode 100755
index 00000000..d406fb85
--- /dev/null
+++ b/lmms_eval/tasks/perceptiontest/test/perceptiontest_mc.yaml
@@ -0,0 +1,13 @@
+dataset_name: "mc_question"
+task: "perceptiontest_test_mc"
+test_split: test
+output_type: generate_until
+doc_to_visual: !function utils.perceptiontest_doc_to_visual
+doc_to_text: !function utils.perceptiontest_doc_to_text
+doc_to_target: !function utils.perceptiontest_doc_to_answer_mc
+process_results: !function utils.perceptiontest_process_results_mc
+metric_list:
+  - metric: submission
+    aggregation: !function utils.perceptiontest_aggregate_mc
+    higher_is_better: true
+include: _default_template_yaml
diff --git a/lmms_eval/tasks/perceptiontest/test/perceptiontest_mcppl.yaml b/lmms_eval/tasks/perceptiontest/test/perceptiontest_mcppl.yaml
new file mode 100755
index 00000000..ec7ac4f1
--- /dev/null
+++ b/lmms_eval/tasks/perceptiontest/test/perceptiontest_mcppl.yaml
@@ -0,0 +1,14 @@
+dataset_name: "mc_question"
+task: "perceptiontest_test_mcppl"
+test_split: test
+output_type: multiple_choice
+doc_to_visual: !function utils.perceptiontest_doc_to_visual
+doc_to_text: "question"
+doc_to_target: !function utils.perceptiontest_doc_to_answer_mc
+doc_to_choice: !function utils.perceptiontest_doc_to_choice
+process_results: !function utils.perceptiontest_process_results_mc_ppl
+metric_list:
+  - metric: submission
+    aggregation: !function utils.perceptiontest_aggregate_mc_ppl
+    higher_is_better: true
+include: _default_template_yaml
diff --git a/lmms_eval/tasks/perceptiontest/test/utils.py b/lmms_eval/tasks/perceptiontest/test/utils.py
new file mode 100755
index 00000000..35307179
--- /dev/null
+++ b/lmms_eval/tasks/perceptiontest/test/utils.py
@@ -0,0 +1,125 @@
+from decord import VideoReader, cpu
+import numpy as np
+import os
+import sys
+import datetime
+import lmms_eval.tasks._task_utils.file_utils as file_utils
+import json
+import logging
+import yaml
+from pathlib import Path
+
+with open(Path(__file__).parent / "_default_template_yaml", "r") as f:
+    raw_data = f.readlines()
+    safe_data = []
+    for i, line in enumerate(raw_data):
+        # remove function definition since yaml load cannot handle it
+        if "!function" not in line:
+            safe_data.append(line)
+
+    config = yaml.safe_load("".join(safe_data))
+
+# We will unzip all the zip files
+# To HF HOME cache dir
+# And load it here
+HF_HOME = os.environ["HF_HOME"]
+cache_dir = config["dataset_kwargs"]["cache_dir"]
+cache_dir = os.path.join(HF_HOME, cache_dir)
+cache_dir = os.path.join(cache_dir, "videos")
+
+eval_logger = logging.getLogger("lmms-eval")
+
+
+# Pass in video path here
+# Can only work correctly with video llm
+def perceptiontest_doc_to_visual(doc):
+    video_path = doc["video_name"] + ".mp4"
+    video_path = os.path.join(cache_dir, video_path)
+    if os.path.exists(video_path):
+        video_path = video_path
+    elif os.path.exists(video_path.replace("mp4", "MP4")):
+        video_path = video_path.replace("mp4", "MP4")
+    else:
+        sys.exit(f"video path:{video_path} does not exist, please check")
+    return [video_path]
+
+
+# This is the place where you format your question
+def perceptiontest_doc_to_text(doc, model_specific_prompt_kwargs=None):
+    if model_specific_prompt_kwargs is None:
+        model_specific_prompt_kwargs = {}
+    pre_prompt = ""
+    post_prompt = ""
+    if "pre_prompt" in model_specific_prompt_kwargs:
+        pre_prompt = model_specific_prompt_kwargs["pre_prompt"]
+    if "post_prompt" in model_specific_prompt_kwargs:
+        post_prompt = model_specific_prompt_kwargs["post_prompt"]
+
+    question = doc["question"]
+    if "options" in doc:
+        index = 0
+        for op in doc["options"]:
+            if index == 0:
+                question += "\n" + "A. " + op
+            elif index == 1:
+                question += "\n" + "B. " + op
+            else:
+                question += "\n" + "C. " + op
+            index += 1
+        post_prompt = "\nAnswer with the option's letter from the given choices directly."
+
+    return f"{pre_prompt}{question}{post_prompt}"
+
+
+def perceptiontest_doc_to_answer_mc(doc):
+    return ""  # pseudo answer
+
+
+# Process result for mc_ppl
+def perceptiontest_process_results_mc_ppl(doc, result):
+    # Initialize minimum value and index
+    min_value = float("inf")
+    min_index = -1
+
+    # Iterate through the results to find the index of the lowest value
+    for i, (value, _) in enumerate(result):
+        if value < min_value:
+            min_value = value
+            min_index = i
+
+    # Return the result with the index of the lowest value
+    return {"submission": {"video_name": doc["video_name"], "question": doc["question"], "question_id": doc["question_id"], "pred_id": min_index}}
+
+
+# Process result for generation
+def perceptiontest_process_results_mc(doc, result):
+    pred = result[0]  # string prediction "A", "B", "C"
+
+    # Map the prediction to an index
+    pred_to_index = {"A": 0, "B": 1, "C": 2}
+    index = pred_to_index.get(pred, -1)  # Default to -1 if the prediction is not found
+
+    return {"submission": {"video_name": doc["video_name"], "question": doc["question"], "question_id": doc["question_id"], "pred_id": index}}
+
+
+def perceptiontest_aggregate_submissions(results, args, task):
+    now_date_time = datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
+    submission_file_name = f"inference_results_perceptiontest_{task}_{now_date_time}.json"
+    path = file_utils.generate_submission_file(submission_file_name, args)
+
+    with open(path, "w") as f:
+        json.dump(results, f, indent=4)
+
+    eval_logger.info(f"Submission file saved to {path}")
+
+
+def perceptiontest_aggregate_mc(results, args):
+    perceptiontest_aggregate_submissions(results, args, "MC")
+
+
+def perceptiontest_aggregate_mc_ppl(results, args):
+    perceptiontest_aggregate_submissions(results, args, "MC_PPL")
+
+
+def perceptiontest_doc_to_choice(doc):
+    return [op for op in doc["options"]]
diff --git a/lmms_eval/tasks/perceptiontest/val/_default_template_yaml b/lmms_eval/tasks/perceptiontest/val/_default_template_yaml
new file mode 100644
index 00000000..82dd37ca
--- /dev/null
+++ b/lmms_eval/tasks/perceptiontest/val/_default_template_yaml
@@ -0,0 +1,9 @@
+dataset_path: lmms-lab/PerceptionTest_Val
+dataset_kwargs:
+  token: True
+  video: True
+  cache_dir: perceptiontest_val
+model_specific_prompt_kwargs:
+  default:
+    pre_prompt: ""
+    post_prompt: ""
\ No newline at end of file
diff --git a/lmms_eval/tasks/perceptiontest/val/perceptiontest_mc.yaml b/lmms_eval/tasks/perceptiontest/val/perceptiontest_mc.yaml
new file mode 100755
index 00000000..fce82a65
--- /dev/null
+++ b/lmms_eval/tasks/perceptiontest/val/perceptiontest_mc.yaml
@@ -0,0 +1,13 @@
+dataset_name: "mc_question_val"
+task: "perceptiontest_val_mc"
+test_split: validation
+output_type: generate_until
+doc_to_visual: !function utils.perceptiontest_val_doc_to_visual
+doc_to_text: !function utils.perceptiontest_val_doc_to_text
+doc_to_target: !function utils.perceptiontest_val_doc_to_answer
+process_results: !function utils.perceptiontest_val_process_results_mc
+metric_list:
+  - metric: accuracy
+    aggregation: !function utils.perceptiontest_val_aggregate_accuracy
+    higher_is_better: true
+include: _default_template_yaml
diff --git a/lmms_eval/tasks/perceptiontest/val/perceptiontest_mcppl.yaml b/lmms_eval/tasks/perceptiontest/val/perceptiontest_mcppl.yaml
new file mode 100755
index 00000000..a2371cdb
--- /dev/null
+++ b/lmms_eval/tasks/perceptiontest/val/perceptiontest_mcppl.yaml
@@ -0,0 +1,14 @@
+dataset_name: "mc_question_val"
+task: "perceptiontest_val_mcppl"
+test_split: validation
+output_type: multiple_choice
+doc_to_visual: !function utils.perceptiontest_val_doc_to_visual
+doc_to_text: "question"
+doc_to_target: !function utils.perceptiontest_val_doc_to_answer
+doc_to_choice: !function utils.perceptiontest_val_doc_to_choice
+process_results: !function utils.perceptiontest_val_process_results_mc_ppl
+metric_list:
+  - metric: accuracy
+    aggregation: !function utils.perceptiontest_val_aggregate_accuracy
+    higher_is_better: true
+include: _default_template_yaml
diff --git a/lmms_eval/tasks/perceptiontest/val/utils.py b/lmms_eval/tasks/perceptiontest/val/utils.py
new file mode 100755
index 00000000..ceb05784
--- /dev/null
+++ b/lmms_eval/tasks/perceptiontest/val/utils.py
@@ -0,0 +1,132 @@
+from decord import VideoReader, cpu
+import numpy as np
+import os
+import sys
+import datetime
+import lmms_eval.tasks._task_utils.file_utils as file_utils
+import json
+import logging
+import yaml
+from pathlib import Path
+
+with open(Path(__file__).parent / "_default_template_yaml", "r") as f:
+    raw_data = f.readlines()
+    safe_data = []
+    for i, line in enumerate(raw_data):
+        # remove function definition since yaml load cannot handle it
+        if "!function" not in line:
+            safe_data.append(line)
+
+    config = yaml.safe_load("".join(safe_data))
+
+# We will unzip all the zip files
+# To HF HOME cache dir
+# And load it here
+HF_HOME = os.environ["HF_HOME"]
+cache_dir = config["dataset_kwargs"]["cache_dir"]
+cache_dir = os.path.join(HF_HOME, cache_dir)
+cache_dir = os.path.join(cache_dir, "videos")
+
+eval_logger = logging.getLogger("lmms-eval")
+
+
+# Pass in video path here
+# Can only work correctly with video llm
+def perceptiontest_val_doc_to_visual(doc):
+    video_path = doc["video_name"] + ".mp4"
+    video_path = os.path.join(cache_dir, video_path)
+    if os.path.exists(video_path):
+        video_path = video_path
+    elif os.path.exists(video_path.replace("mp4", "MP4")):
+        video_path = video_path.replace("mp4", "MP4")
+    else:
+        sys.exit(f"video path:{video_path} does not exist, please check")
+    return [video_path]
+
+
+# This is the place where you format your question
+def perceptiontest_val_doc_to_text(doc, model_specific_prompt_kwargs=None):
+    if model_specific_prompt_kwargs is None:
+        model_specific_prompt_kwargs = {}
+    pre_prompt = ""
+    post_prompt = ""
+    if "pre_prompt" in model_specific_prompt_kwargs:
+        pre_prompt = model_specific_prompt_kwargs["pre_prompt"]
+    if "post_prompt" in model_specific_prompt_kwargs:
+        post_prompt = model_specific_prompt_kwargs["post_prompt"]
+
+    question = doc["question"]
+    if "options" in doc:
+        index = 0
+        for op in doc["options"]:
+            if index == 0:
+                question += "\n" + "A. " + op
+            elif index == 1:
+                question += "\n" + "B. " + op
+            else:
+                question += "\n" + "C. " + op
+            index += 1
+        post_prompt = "\nAnswer with the option's letter from the given choices directly."
+
+    return f"{pre_prompt}{question}{post_prompt}"
+
+
+def perceptiontest_val_doc_to_answer(doc):
+    return doc["answer_id"]
+
+
+# Process result for mc_ppl
+def perceptiontest_val_process_results_mc_ppl(doc, result):
+    # Initialize minimum value and index
+    min_value = float("inf")
+    min_index = -1
+
+    # Iterate through the results to find the index of the lowest value
+    for i, (value, _) in enumerate(result):
+        if value < min_value:
+            min_value = value
+            min_index = i
+
+    # Return the result with the index of the lowest value
+    return {
+        "accuracy": {
+            "video_name": doc["video_name"],
+            "question": doc["question"],
+            "question_id": doc["question_id"],
+            "pred_id": min_index,
+            "answer_id": doc["answer_id"],
+            "area": doc["area"],
+            "reasoning": doc["reasoning"],
+            "tag": doc["tag"],
+        }
+    }
+
+
+# Process result for generation
+def perceptiontest_val_process_results_mc(doc, result):
+    pred = result[0]  # string prediction "A", "B", "C"
+
+    # Map the prediction to an index
+    pred_to_index = {"A": 0, "B": 1, "C": 2}
+    index = pred_to_index.get(pred, -1)  # Default to -1 if the prediction is not found
+
+    return {
+        "accuracy": {"video_name": doc["video_name"], "question": doc["question"], "question_id": doc["question_id"], "pred_id": index, "answer_id": doc["answer_id"], "area": doc["area"], "reasoning": doc["reasoning"], "tag": doc["tag"]}
+    }
+
+
+def perceptiontest_val_aggregate_accuracy(results, args):
+    yes_count = 0
+
+    # results is a list of dict
+    for answer_dict in results:
+        if str(answer_dict["answer_id"]) == str(answer_dict["pred_id"]):
+            yes_count = yes_count + 1
+
+    accuracy = yes_count / len(results)
+
+    return accuracy
+
+
+def perceptiontest_val_doc_to_choice(doc):
+    return [op for op in doc["options"]]
diff --git a/lmms_eval/tasks/pope/pope.yaml b/lmms_eval/tasks/pope/pope.yaml
old mode 100644
new mode 100755
index 703fe3d8..715cf7d1
--- a/lmms_eval/tasks/pope/pope.yaml
+++ b/lmms_eval/tasks/pope/pope.yaml
@@ -10,7 +10,7 @@ doc_to_target: "answer"
 generation_kwargs:
   max_new_tokens: 128
   temperature: 0
-  top_p: 0
+  top_p: 1.0
   num_beams: 1
   do_sample: false
 process_results: !function utils.pope_process_results
diff --git a/lmms_eval/tasks/pope/utils.py b/lmms_eval/tasks/pope/utils.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/realworldqa/realworldqa.yaml b/lmms_eval/tasks/realworldqa/realworldqa.yaml
index f832ffb1..4b414967 100644
--- a/lmms_eval/tasks/realworldqa/realworldqa.yaml
+++ b/lmms_eval/tasks/realworldqa/realworldqa.yaml
@@ -11,7 +11,7 @@ doc_to_target: "answer"
 generation_kwargs:
   max_new_tokens: 16
   temperature: 0
-  top_p: 0
+  top_p: 1.0
   num_beams: 1
   do_sample: false
 
@@ -39,5 +39,8 @@ model_specific_prompt_kwargs:
   gpt4v:
     pre_prompt: ""
     post_prompt: ""
+  xcomposer2_4khd:
+    pre_prompt: "[UNUSED_TOKEN_146]user\nQuestion: "
+    post_prompt: "[UNUSED_TOKEN_145]\n[UNUSED_TOKEN_146]assistant\nThe answer is"
 metadata:
   - version: 0.0
diff --git a/lmms_eval/tasks/refcoco+/_default_template_bbox_rec_yaml b/lmms_eval/tasks/refcoco+/_default_template_bbox_rec_yaml
deleted file mode 100644
index c369baee..00000000
--- a/lmms_eval/tasks/refcoco+/_default_template_bbox_rec_yaml
+++ /dev/null
@@ -1,34 +0,0 @@
-dataset_path: lmms-lab/RefCOCOPlus
-output_type: generate_until
-process_docs: !function utils_rec.refcoco_bbox_rec_preprocess_dataset
-doc_to_visual: !function utils_rec.refcoco_bbox_rec_doc_to_visual
-doc_to_text: !function utils_rec.refcoco_bbox_rec_doc_to_text
-doc_to_target: "bbox"
-generation_kwargs:
-  until:
-    - "ASSISTANT:"
-process_results: !function utils_rec.refcoco_bbox_rec_process_result
-metric_list:
-  - metric: refcoco_IoU
-    aggregation : !function utils_rec.refcoco_bbox_rec_iou
-    higher_is_better : true
-  - metric: refcoco_ACC@0.1
-    aggregation : !function utils_rec.refcoco_bbox_rec_acc01
-    higher_is_better : true
-  - metric: refcoco_ACC@0.3
-    aggregation : !function utils_rec.refcoco_bbox_rec_acc03
-    higher_is_better : true
-  - metric: refcoco_ACC@0.5
-    aggregation : !function utils_rec.refcoco_bbox_rec_acc05
-    higher_is_better : true
-  - metric: refcoco_ACC@0.7
-    aggregation : !function utils_rec.refcoco_bbox_rec_acc07
-    higher_is_better : true
-  - metric: refcoco_ACC@0.9
-    aggregation : !function utils_rec.refcoco_bbox_rec_acc09
-    higher_is_better : true
-  - metric: refcoco_Center_ACC
-    aggregation : !function utils_rec.refcoco_bbox_rec_center_acc
-    higher_is_better : true
-metadata:
-  version: '0.0'
\ No newline at end of file
diff --git a/lmms_eval/tasks/refcoco+/_default_template_bbox_yaml b/lmms_eval/tasks/refcoco+/_default_template_bbox_yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/refcoco+/_default_template_seg_yaml b/lmms_eval/tasks/refcoco+/_default_template_seg_yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/refcoco+/_generate_config.py b/lmms_eval/tasks/refcoco+/_generate_config.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/refcoco+/_refcoco.yaml b/lmms_eval/tasks/refcoco+/_refcoco.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/refcoco+/refcoco+_bbox_rec_testA.yaml b/lmms_eval/tasks/refcoco+/refcoco+_bbox_rec_testA.yaml
deleted file mode 100644
index 0ebb6c0c..00000000
--- a/lmms_eval/tasks/refcoco+/refcoco+_bbox_rec_testA.yaml
+++ /dev/null
@@ -1,4 +0,0 @@
-group: refcoco+_bbox_rec
-task: refcoco+_bbox_rec_testA
-include: _default_template_bbox_rec_yaml
-test_split: testA
diff --git a/lmms_eval/tasks/refcoco+/refcoco+_bbox_rec_testB.yaml b/lmms_eval/tasks/refcoco+/refcoco+_bbox_rec_testB.yaml
deleted file mode 100644
index b347bce6..00000000
--- a/lmms_eval/tasks/refcoco+/refcoco+_bbox_rec_testB.yaml
+++ /dev/null
@@ -1,4 +0,0 @@
-group: refcoco+_bbox_rec
-task: refcoco+_bbox_rec_testB
-include: _default_template_bbox_rec_yaml
-test_split: testB
diff --git a/lmms_eval/tasks/refcoco+/refcoco+_bbox_rec_val.yaml b/lmms_eval/tasks/refcoco+/refcoco+_bbox_rec_val.yaml
deleted file mode 100644
index 890f588b..00000000
--- a/lmms_eval/tasks/refcoco+/refcoco+_bbox_rec_val.yaml
+++ /dev/null
@@ -1,4 +0,0 @@
-group: refcoco+_bbox_rec
-task: refcoco+_bbox_rec_val
-include: _default_template_bbox_rec_yaml
-test_split: val
diff --git a/lmms_eval/tasks/refcoco+/refcoco+_bbox_testA.yaml b/lmms_eval/tasks/refcoco+/refcoco+_bbox_testA.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/refcoco+/refcoco+_bbox_testB.yaml b/lmms_eval/tasks/refcoco+/refcoco+_bbox_testB.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/refcoco+/refcoco+_bbox_val.yaml b/lmms_eval/tasks/refcoco+/refcoco+_bbox_val.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/refcoco+/refcoco+_seg_testA.yaml b/lmms_eval/tasks/refcoco+/refcoco+_seg_testA.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/refcoco+/refcoco+_seg_testB.yaml b/lmms_eval/tasks/refcoco+/refcoco+_seg_testB.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/refcoco+/refcoco+_seg_val.yaml b/lmms_eval/tasks/refcoco+/refcoco+_seg_val.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/refcoco+/utils.py b/lmms_eval/tasks/refcoco+/utils.py
old mode 100644
new mode 100755
index 4feb71cb..f1d43606
--- a/lmms_eval/tasks/refcoco+/utils.py
+++ b/lmms_eval/tasks/refcoco+/utils.py
@@ -49,7 +49,7 @@ def refcoco_process_result(doc, result):
 
 
 def refcoco_aggregation_result(results, metric):
-    scorers = [(Bleu(4), "Bleu_1"), (Bleu(4), "Bleu_2"), (Bleu(4), "Bleu_3"), (Bleu(4), "Bleu_4"), (Meteor(), "METEOR"), (Rouge(), "ROUGE_L"), (Cider(), "CIDEr")]#, (Spice(), "SPICE")]
+    scorers = [(Bleu(4), "Bleu_1"), (Bleu(4), "Bleu_2"), (Bleu(4), "Bleu_3"), (Bleu(4), "Bleu_4"), (Meteor(), "METEOR"), (Rouge(), "ROUGE_L"), (Cider(), "CIDEr")]  # , (Spice(), "SPICE")]
     scorers_dict = {s[1]: s for s in scorers}
 
     stored_results = []
diff --git a/lmms_eval/tasks/refcoco+/utils_rec.py b/lmms_eval/tasks/refcoco+/utils_rec.py
deleted file mode 100644
index ec91bf9c..00000000
--- a/lmms_eval/tasks/refcoco+/utils_rec.py
+++ /dev/null
@@ -1,221 +0,0 @@
-import re 
-import logging
-from datasets import Dataset
-
-eval_logger = logging.getLogger("lmms-eval")
-
-COCO_REC_METRICS = ["IoU", "ACC@0.1", "ACC@0.3", "ACC@0.5", "ACC@0.7", "ACC@0.9", "Center_ACC"]
-
-
-def refcoco_bbox_rec_preprocess_dataset(dataset: Dataset):
-    # PIL image stored in dataset['image']
-    # add `image_width` and `image_height` to the dataset
-    dataset = dataset.map(lambda x: {"image_width": x["image"].width, "image_height": x["image"].height})
-
-    # Original bbox format (top x, top y, width, height)
-    # Convert to (top-left x, top-left y, bottom-right x, bottom-right y)
-    # Normalize the bounding box coordinates to be between 0 and 1 
-    # using the image width and height
-    dataset = dataset.map(
-        lambda x: {"bbox": [x["bbox"][0] / x["image_width"], 
-                            x["bbox"][1] / x["image_height"],
-                           (x["bbox"][0] + x["bbox"][2]) / x["image_width"],
-                           (x["bbox"][1] + x["bbox"][3]) / x["image_height"]]}
-    )
-
-    # currently, the dataset has `answer` as a list of strings
-    # each answer should be its own row
-    # we will explode the dataset to have one row per answer
-    # duplicate the other columns
-    def explode_answers(example):
-        answers = example.pop('answer')
-        return [{'answer': answer, **example} for answer in answers]
-
-    # Apply the function to each element, collecting the results
-    exploded_rows = []
-    for example in dataset:
-        exploded_rows.extend(explode_answers(example))
-
-    # Create a new dataset from the exploded rows
-    new_dataset = Dataset.from_list(exploded_rows)
-    print(f"Exploded dataset from {len(dataset)} to {len(new_dataset)} rows")
-
-    return new_dataset
-
-
-def refcoco_bbox_rec_doc_to_visual(doc):
-    # Image is presented as is
-    image = doc["image"].convert("RGB")
-    return [image.convert("RGB")]
-
-
-def refcoco_bbox_rec_doc_to_text(doc):
-    assert isinstance(doc['answer'], str), "Answer must be a string"
-    return "Bounding box coordinates are specified in the format (top-left x, top-left y, bottom-right x, bottom-right y). All values are floating point numbers bounded between 0 and 1. Please provide the bounding box coordinate of the region this sentence describes: " + doc['answer']
-
-
-def parse_float_sequence_within(input_str):
-    """
-    Extract the first sequence of four floating-point numbers within square brackets from a string.
-
-    Args:
-    input_str (str): A string that may contain a sequence of four floats within square brackets.
-
-    Returns:
-    list: A list of four floats if the pattern is found, or a list of four zeros if the pattern is not found.
-    """
-    # Define the regex pattern to find the first instance of four floats within square brackets
-    pattern = r'\[\s*(-?\d+(?:\.\d+)?),\s*(-?\d+(?:\.\d+)?),\s*(-?\d+(?:\.\d+)?),\s*(-?\d+(?:\.\d+)?)\s*\]'
-    
-    # Use re.search to find the first match of the pattern in the input string
-    match = re.search(pattern, input_str)
-    
-    # If a match is found, convert the captured groups into a list of floats
-    if match:
-        return [float(match.group(i)) for i in range(1, 5)]
-    
-    # If the input does not contain the pattern, return the null float sequence
-    return [0, 0, 0, 0]
-
-
-def refcoco_bbox_rec_process_result(doc, result):
-    """
-    Args:
-        doc: a instance of the eval dataset
-        results: [pred]
-    Returns:
-        a dictionary with key: metric name, value: metric value
-    """
-    pred = result[0] if len(result) > 0 else ""
-    pred = parse_float_sequence_within(pred)
-    ann_id = doc["question_id"]
-    data_dict = {"answer": doc["answer"], "pred": pred, "ann_id": ann_id, 'bbox': doc['bbox']}
-    return {f"refcoco_{metric}": data_dict for metric in COCO_REC_METRICS}
-
-
-def compute_iou(box1, box2):
-    """
-    Compute the Intersection over Union (IoU) of two bounding boxes.
-
-    Parameters:
-    - box1 (list of float): Bounding box [x_min, y_min, x_max, y_max].
-    - box2 (list of float): Bounding box [x_min, y_min, x_max, y_max].
-
-    Returns:
-    - float: IoU of box1 and box2.
-    """
-    # Determine the coordinates of the intersection rectangle
-    x_left = max(box1[0], box2[0])
-    y_top = max(box1[1], box2[1])
-    x_right = min(box1[2], box2[2])
-    y_bottom = min(box1[3], box2[3])
-
-    # Compute the area of intersection
-    intersection_area = max(0, x_right - x_left) * max(0, y_bottom - y_top)
-
-    # Compute the area of both bounding boxes
-    box1_area = (box1[2] - box1[0]) * (box1[3] - box1[1])
-    box2_area = (box2[2] - box2[0]) * (box2[3] - box2[1])
-
-    # Compute the area of the union
-    union_area = box1_area + box2_area - intersection_area
-
-    # Compute the Intersection over Union
-    iou = intersection_area / union_area
-
-    return iou
-
-
-def compute_accuracy(box1, box2, threshold=0.5):
-    """
-    Compute the accuracy of two bounding boxes based on a specified threshold.
-
-    Parameters:
-    - box1 (list of float): Bounding box [x_min, y_min, x_max, y_max].
-    - box2 (list of float): Bounding box [x_min, y_min, x_max, y_max].
-    - threshold (float): Threshold for the IoU to consider the prediction correct.
-
-    Returns:
-    - float: Accuracy of the prediction based on the IoU threshold.
-    """
-    iou = compute_iou(box1, box2)
-    return iou >= threshold
-
-
-def compute_center_accuracy(box1, box2):
-    """
-    Compute if the center point of box 2 is within box 1.
-
-    Parameters:
-    - box1 (list of float): Bounding box [x_min, y_min, x_max, y_max].
-    - box2 (list of float): Bounding box [x_min, y_min, x_max, y_max].
-
-    Returns:
-    - bool: True if the center point of box 2 is within box 1, False otherwise.
-    """
-    # Compute the center point of box 2
-    center_x = (box2[0] + box2[2]) / 2
-    center_y = (box2[1] + box2[3]) / 2
-
-    # Check if the center point is within box 1
-    return box1[0] <= center_x <= box1[2] and box1[1] <= center_y <= box1[3]
-
-
-def refcoco_bbox_rec_aggregation_result(results, metric):
-    """
-    Aggregate the results of the RefCOCO evaluation task using the specified metric.
-
-    Args:
-    - results (list of dict): List of result dictionaries.
-    - metric (str): Metric to use for aggregation.
-
-    Returns:
-    - dict: Dictionary containing the aggregated results for the specified metric.
-    """
-    scorers = {
-        'IoU': compute_iou,
-        'ACC@0.1': lambda x, y: compute_accuracy(x, y, 0.1),
-        'ACC@0.3': lambda x, y: compute_accuracy(x, y, 0.3),
-        'ACC@0.5': lambda x, y: compute_accuracy(x, y, 0.5),
-        'ACC@0.7': lambda x, y: compute_accuracy(x, y, 0.7),
-        'ACC@0.9': lambda x, y: compute_accuracy(x, y, 0.9),
-        'Center_ACC': compute_center_accuracy
-    }
-    results_dict = {metric: []}
-    for result in results:
-        # Extract the ground truth and predicted bounding boxes
-        gt_bbox = result['bbox']
-        pred_bbox = result['pred']
-        # Compute the specified metric between the ground truth and predicted bounding boxes
-        score = scorers[metric](gt_bbox, pred_bbox)
-        results_dict[metric].append(score)
-    results_dict[metric] = sum(results_dict[metric]) / len(results_dict[metric])
-    print(f"Aggregated {metric} score: {results_dict[metric]}")
-    return results_dict[metric]
-
-
-def refcoco_bbox_rec_iou(results):
-    return refcoco_bbox_rec_aggregation_result(results, "IoU")
-
-
-def refcoco_bbox_rec_acc01(results):
-    return refcoco_bbox_rec_aggregation_result(results, "ACC@0.1")
-
-def refcoco_bbox_rec_acc03(results):
-    return refcoco_bbox_rec_aggregation_result(results, "ACC@0.3")
-
-
-def refcoco_bbox_rec_acc05(results):
-    return refcoco_bbox_rec_aggregation_result(results, "ACC@0.5")
-
-
-def refcoco_bbox_rec_acc07(results):
-    return refcoco_bbox_rec_aggregation_result(results, "ACC@0.7")
-
-
-def refcoco_bbox_rec_acc09(results):
-    return refcoco_bbox_rec_aggregation_result(results, "ACC@0.9")
-
-
-def refcoco_bbox_rec_center_acc(results):
-    return refcoco_bbox_rec_aggregation_result(results, "Center_ACC")
diff --git a/lmms_eval/tasks/refcoco/_default_template_bbox_rec_yaml b/lmms_eval/tasks/refcoco/_default_template_bbox_rec_yaml
deleted file mode 100644
index 3b5ca100..00000000
--- a/lmms_eval/tasks/refcoco/_default_template_bbox_rec_yaml
+++ /dev/null
@@ -1,34 +0,0 @@
-dataset_path: lmms-lab/RefCOCO
-output_type: generate_until
-process_docs: !function utils_rec.refcoco_bbox_rec_preprocess_dataset
-doc_to_visual: !function utils_rec.refcoco_bbox_rec_doc_to_visual
-doc_to_text: !function utils_rec.refcoco_bbox_rec_doc_to_text
-doc_to_target: "bbox"
-generation_kwargs:
-  until:
-    - "ASSISTANT:"
-process_results: !function utils_rec.refcoco_bbox_rec_process_result
-metric_list:
-  - metric: refcoco_IoU
-    aggregation : !function utils_rec.refcoco_bbox_rec_iou
-    higher_is_better : true
-  - metric: refcoco_ACC@0.1
-    aggregation : !function utils_rec.refcoco_bbox_rec_acc01
-    higher_is_better : true
-  - metric: refcoco_ACC@0.3
-    aggregation : !function utils_rec.refcoco_bbox_rec_acc03
-    higher_is_better : true
-  - metric: refcoco_ACC@0.5
-    aggregation : !function utils_rec.refcoco_bbox_rec_acc05
-    higher_is_better : true
-  - metric: refcoco_ACC@0.7
-    aggregation : !function utils_rec.refcoco_bbox_rec_acc07
-    higher_is_better : true
-  - metric: refcoco_ACC@0.9
-    aggregation : !function utils_rec.refcoco_bbox_rec_acc09
-    higher_is_better : true
-  - metric: refcoco_Center_ACC
-    aggregation : !function utils_rec.refcoco_bbox_rec_center_acc
-    higher_is_better : true
-metadata:
-  version: '0.0'
\ No newline at end of file
diff --git a/lmms_eval/tasks/refcoco/_default_template_bbox_yaml b/lmms_eval/tasks/refcoco/_default_template_bbox_yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/refcoco/_default_template_seg_yaml b/lmms_eval/tasks/refcoco/_default_template_seg_yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/refcoco/_generate_config.py b/lmms_eval/tasks/refcoco/_generate_config.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/refcoco/_refcoco.yaml b/lmms_eval/tasks/refcoco/_refcoco.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/refcoco/refcoco_bbox_rec_test.yaml b/lmms_eval/tasks/refcoco/refcoco_bbox_rec_test.yaml
deleted file mode 100644
index 896ed4ac..00000000
--- a/lmms_eval/tasks/refcoco/refcoco_bbox_rec_test.yaml
+++ /dev/null
@@ -1,4 +0,0 @@
-group: refcoco_bbox_rec
-task: refcoco_bbox_rec_test
-test_split: test
-include: _default_template_bbox_rec_yaml
diff --git a/lmms_eval/tasks/refcoco/refcoco_bbox_rec_testA.yaml b/lmms_eval/tasks/refcoco/refcoco_bbox_rec_testA.yaml
deleted file mode 100644
index 191268a6..00000000
--- a/lmms_eval/tasks/refcoco/refcoco_bbox_rec_testA.yaml
+++ /dev/null
@@ -1,4 +0,0 @@
-group: refcoco_bbox_rec
-task: refcoco_bbox_rec_testA
-test_split: testA
-include: _default_template_bbox_rec_yaml
diff --git a/lmms_eval/tasks/refcoco/refcoco_bbox_rec_testB.yaml b/lmms_eval/tasks/refcoco/refcoco_bbox_rec_testB.yaml
deleted file mode 100644
index 39b29071..00000000
--- a/lmms_eval/tasks/refcoco/refcoco_bbox_rec_testB.yaml
+++ /dev/null
@@ -1,4 +0,0 @@
-group: refcoco_bbox_rec
-task: refcoco_bbox_rec_testB
-test_split: testB
-include: _default_template_bbox_rec_yaml
diff --git a/lmms_eval/tasks/refcoco/refcoco_bbox_rec_val.yaml b/lmms_eval/tasks/refcoco/refcoco_bbox_rec_val.yaml
deleted file mode 100644
index f5da6c5e..00000000
--- a/lmms_eval/tasks/refcoco/refcoco_bbox_rec_val.yaml
+++ /dev/null
@@ -1,4 +0,0 @@
-group: refcoco_bbox_rec
-task: refcoco_bbox_rec_val
-test_split: val
-include: _default_template_bbox_rec_yaml
diff --git a/lmms_eval/tasks/refcoco/refcoco_bbox_test.yaml b/lmms_eval/tasks/refcoco/refcoco_bbox_test.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/refcoco/refcoco_bbox_testA.yaml b/lmms_eval/tasks/refcoco/refcoco_bbox_testA.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/refcoco/refcoco_bbox_testB.yaml b/lmms_eval/tasks/refcoco/refcoco_bbox_testB.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/refcoco/refcoco_bbox_val.yaml b/lmms_eval/tasks/refcoco/refcoco_bbox_val.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/refcoco/refcoco_seg_test.yaml b/lmms_eval/tasks/refcoco/refcoco_seg_test.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/refcoco/refcoco_seg_testA.yaml b/lmms_eval/tasks/refcoco/refcoco_seg_testA.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/refcoco/refcoco_seg_testB.yaml b/lmms_eval/tasks/refcoco/refcoco_seg_testB.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/refcoco/refcoco_seg_val.yaml b/lmms_eval/tasks/refcoco/refcoco_seg_val.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/refcoco/utils.py b/lmms_eval/tasks/refcoco/utils.py
old mode 100644
new mode 100755
index 4feb71cb..f1d43606
--- a/lmms_eval/tasks/refcoco/utils.py
+++ b/lmms_eval/tasks/refcoco/utils.py
@@ -49,7 +49,7 @@ def refcoco_process_result(doc, result):
 
 
 def refcoco_aggregation_result(results, metric):
-    scorers = [(Bleu(4), "Bleu_1"), (Bleu(4), "Bleu_2"), (Bleu(4), "Bleu_3"), (Bleu(4), "Bleu_4"), (Meteor(), "METEOR"), (Rouge(), "ROUGE_L"), (Cider(), "CIDEr")]#, (Spice(), "SPICE")]
+    scorers = [(Bleu(4), "Bleu_1"), (Bleu(4), "Bleu_2"), (Bleu(4), "Bleu_3"), (Bleu(4), "Bleu_4"), (Meteor(), "METEOR"), (Rouge(), "ROUGE_L"), (Cider(), "CIDEr")]  # , (Spice(), "SPICE")]
     scorers_dict = {s[1]: s for s in scorers}
 
     stored_results = []
diff --git a/lmms_eval/tasks/refcoco/utils_rec.py b/lmms_eval/tasks/refcoco/utils_rec.py
deleted file mode 100644
index ec91bf9c..00000000
--- a/lmms_eval/tasks/refcoco/utils_rec.py
+++ /dev/null
@@ -1,221 +0,0 @@
-import re 
-import logging
-from datasets import Dataset
-
-eval_logger = logging.getLogger("lmms-eval")
-
-COCO_REC_METRICS = ["IoU", "ACC@0.1", "ACC@0.3", "ACC@0.5", "ACC@0.7", "ACC@0.9", "Center_ACC"]
-
-
-def refcoco_bbox_rec_preprocess_dataset(dataset: Dataset):
-    # PIL image stored in dataset['image']
-    # add `image_width` and `image_height` to the dataset
-    dataset = dataset.map(lambda x: {"image_width": x["image"].width, "image_height": x["image"].height})
-
-    # Original bbox format (top x, top y, width, height)
-    # Convert to (top-left x, top-left y, bottom-right x, bottom-right y)
-    # Normalize the bounding box coordinates to be between 0 and 1 
-    # using the image width and height
-    dataset = dataset.map(
-        lambda x: {"bbox": [x["bbox"][0] / x["image_width"], 
-                            x["bbox"][1] / x["image_height"],
-                           (x["bbox"][0] + x["bbox"][2]) / x["image_width"],
-                           (x["bbox"][1] + x["bbox"][3]) / x["image_height"]]}
-    )
-
-    # currently, the dataset has `answer` as a list of strings
-    # each answer should be its own row
-    # we will explode the dataset to have one row per answer
-    # duplicate the other columns
-    def explode_answers(example):
-        answers = example.pop('answer')
-        return [{'answer': answer, **example} for answer in answers]
-
-    # Apply the function to each element, collecting the results
-    exploded_rows = []
-    for example in dataset:
-        exploded_rows.extend(explode_answers(example))
-
-    # Create a new dataset from the exploded rows
-    new_dataset = Dataset.from_list(exploded_rows)
-    print(f"Exploded dataset from {len(dataset)} to {len(new_dataset)} rows")
-
-    return new_dataset
-
-
-def refcoco_bbox_rec_doc_to_visual(doc):
-    # Image is presented as is
-    image = doc["image"].convert("RGB")
-    return [image.convert("RGB")]
-
-
-def refcoco_bbox_rec_doc_to_text(doc):
-    assert isinstance(doc['answer'], str), "Answer must be a string"
-    return "Bounding box coordinates are specified in the format (top-left x, top-left y, bottom-right x, bottom-right y). All values are floating point numbers bounded between 0 and 1. Please provide the bounding box coordinate of the region this sentence describes: " + doc['answer']
-
-
-def parse_float_sequence_within(input_str):
-    """
-    Extract the first sequence of four floating-point numbers within square brackets from a string.
-
-    Args:
-    input_str (str): A string that may contain a sequence of four floats within square brackets.
-
-    Returns:
-    list: A list of four floats if the pattern is found, or a list of four zeros if the pattern is not found.
-    """
-    # Define the regex pattern to find the first instance of four floats within square brackets
-    pattern = r'\[\s*(-?\d+(?:\.\d+)?),\s*(-?\d+(?:\.\d+)?),\s*(-?\d+(?:\.\d+)?),\s*(-?\d+(?:\.\d+)?)\s*\]'
-    
-    # Use re.search to find the first match of the pattern in the input string
-    match = re.search(pattern, input_str)
-    
-    # If a match is found, convert the captured groups into a list of floats
-    if match:
-        return [float(match.group(i)) for i in range(1, 5)]
-    
-    # If the input does not contain the pattern, return the null float sequence
-    return [0, 0, 0, 0]
-
-
-def refcoco_bbox_rec_process_result(doc, result):
-    """
-    Args:
-        doc: a instance of the eval dataset
-        results: [pred]
-    Returns:
-        a dictionary with key: metric name, value: metric value
-    """
-    pred = result[0] if len(result) > 0 else ""
-    pred = parse_float_sequence_within(pred)
-    ann_id = doc["question_id"]
-    data_dict = {"answer": doc["answer"], "pred": pred, "ann_id": ann_id, 'bbox': doc['bbox']}
-    return {f"refcoco_{metric}": data_dict for metric in COCO_REC_METRICS}
-
-
-def compute_iou(box1, box2):
-    """
-    Compute the Intersection over Union (IoU) of two bounding boxes.
-
-    Parameters:
-    - box1 (list of float): Bounding box [x_min, y_min, x_max, y_max].
-    - box2 (list of float): Bounding box [x_min, y_min, x_max, y_max].
-
-    Returns:
-    - float: IoU of box1 and box2.
-    """
-    # Determine the coordinates of the intersection rectangle
-    x_left = max(box1[0], box2[0])
-    y_top = max(box1[1], box2[1])
-    x_right = min(box1[2], box2[2])
-    y_bottom = min(box1[3], box2[3])
-
-    # Compute the area of intersection
-    intersection_area = max(0, x_right - x_left) * max(0, y_bottom - y_top)
-
-    # Compute the area of both bounding boxes
-    box1_area = (box1[2] - box1[0]) * (box1[3] - box1[1])
-    box2_area = (box2[2] - box2[0]) * (box2[3] - box2[1])
-
-    # Compute the area of the union
-    union_area = box1_area + box2_area - intersection_area
-
-    # Compute the Intersection over Union
-    iou = intersection_area / union_area
-
-    return iou
-
-
-def compute_accuracy(box1, box2, threshold=0.5):
-    """
-    Compute the accuracy of two bounding boxes based on a specified threshold.
-
-    Parameters:
-    - box1 (list of float): Bounding box [x_min, y_min, x_max, y_max].
-    - box2 (list of float): Bounding box [x_min, y_min, x_max, y_max].
-    - threshold (float): Threshold for the IoU to consider the prediction correct.
-
-    Returns:
-    - float: Accuracy of the prediction based on the IoU threshold.
-    """
-    iou = compute_iou(box1, box2)
-    return iou >= threshold
-
-
-def compute_center_accuracy(box1, box2):
-    """
-    Compute if the center point of box 2 is within box 1.
-
-    Parameters:
-    - box1 (list of float): Bounding box [x_min, y_min, x_max, y_max].
-    - box2 (list of float): Bounding box [x_min, y_min, x_max, y_max].
-
-    Returns:
-    - bool: True if the center point of box 2 is within box 1, False otherwise.
-    """
-    # Compute the center point of box 2
-    center_x = (box2[0] + box2[2]) / 2
-    center_y = (box2[1] + box2[3]) / 2
-
-    # Check if the center point is within box 1
-    return box1[0] <= center_x <= box1[2] and box1[1] <= center_y <= box1[3]
-
-
-def refcoco_bbox_rec_aggregation_result(results, metric):
-    """
-    Aggregate the results of the RefCOCO evaluation task using the specified metric.
-
-    Args:
-    - results (list of dict): List of result dictionaries.
-    - metric (str): Metric to use for aggregation.
-
-    Returns:
-    - dict: Dictionary containing the aggregated results for the specified metric.
-    """
-    scorers = {
-        'IoU': compute_iou,
-        'ACC@0.1': lambda x, y: compute_accuracy(x, y, 0.1),
-        'ACC@0.3': lambda x, y: compute_accuracy(x, y, 0.3),
-        'ACC@0.5': lambda x, y: compute_accuracy(x, y, 0.5),
-        'ACC@0.7': lambda x, y: compute_accuracy(x, y, 0.7),
-        'ACC@0.9': lambda x, y: compute_accuracy(x, y, 0.9),
-        'Center_ACC': compute_center_accuracy
-    }
-    results_dict = {metric: []}
-    for result in results:
-        # Extract the ground truth and predicted bounding boxes
-        gt_bbox = result['bbox']
-        pred_bbox = result['pred']
-        # Compute the specified metric between the ground truth and predicted bounding boxes
-        score = scorers[metric](gt_bbox, pred_bbox)
-        results_dict[metric].append(score)
-    results_dict[metric] = sum(results_dict[metric]) / len(results_dict[metric])
-    print(f"Aggregated {metric} score: {results_dict[metric]}")
-    return results_dict[metric]
-
-
-def refcoco_bbox_rec_iou(results):
-    return refcoco_bbox_rec_aggregation_result(results, "IoU")
-
-
-def refcoco_bbox_rec_acc01(results):
-    return refcoco_bbox_rec_aggregation_result(results, "ACC@0.1")
-
-def refcoco_bbox_rec_acc03(results):
-    return refcoco_bbox_rec_aggregation_result(results, "ACC@0.3")
-
-
-def refcoco_bbox_rec_acc05(results):
-    return refcoco_bbox_rec_aggregation_result(results, "ACC@0.5")
-
-
-def refcoco_bbox_rec_acc07(results):
-    return refcoco_bbox_rec_aggregation_result(results, "ACC@0.7")
-
-
-def refcoco_bbox_rec_acc09(results):
-    return refcoco_bbox_rec_aggregation_result(results, "ACC@0.9")
-
-
-def refcoco_bbox_rec_center_acc(results):
-    return refcoco_bbox_rec_aggregation_result(results, "Center_ACC")
diff --git a/lmms_eval/tasks/refcocog/_default_template_bbox_rec_yaml b/lmms_eval/tasks/refcocog/_default_template_bbox_rec_yaml
deleted file mode 100644
index ce95812b..00000000
--- a/lmms_eval/tasks/refcocog/_default_template_bbox_rec_yaml
+++ /dev/null
@@ -1,34 +0,0 @@
-dataset_path: lmms-lab/RefCOCOg
-output_type: generate_until
-process_docs: !function utils_rec.refcoco_bbox_rec_preprocess_dataset
-doc_to_visual: !function utils_rec.refcoco_bbox_rec_doc_to_visual
-doc_to_text: !function utils_rec.refcoco_bbox_rec_doc_to_text
-doc_to_target: "bbox"
-generation_kwargs:
-  until:
-    - "ASSISTANT:"
-process_results: !function utils_rec.refcoco_bbox_rec_process_result
-metric_list:
-  - metric: refcoco_IoU
-    aggregation : !function utils_rec.refcoco_bbox_rec_iou
-    higher_is_better : true
-  - metric: refcoco_ACC@0.1
-    aggregation : !function utils_rec.refcoco_bbox_rec_acc01
-    higher_is_better : true
-  - metric: refcoco_ACC@0.3
-    aggregation : !function utils_rec.refcoco_bbox_rec_acc03
-    higher_is_better : true
-  - metric: refcoco_ACC@0.5
-    aggregation : !function utils_rec.refcoco_bbox_rec_acc05
-    higher_is_better : true
-  - metric: refcoco_ACC@0.7
-    aggregation : !function utils_rec.refcoco_bbox_rec_acc07
-    higher_is_better : true
-  - metric: refcoco_ACC@0.9
-    aggregation : !function utils_rec.refcoco_bbox_rec_acc09
-    higher_is_better : true
-  - metric: refcoco_Center_ACC
-    aggregation : !function utils_rec.refcoco_bbox_rec_center_acc
-    higher_is_better : true
-metadata:
-  version: '0.0'
\ No newline at end of file
diff --git a/lmms_eval/tasks/refcocog/_default_template_bbox_yaml b/lmms_eval/tasks/refcocog/_default_template_bbox_yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/refcocog/_default_template_seg_yaml b/lmms_eval/tasks/refcocog/_default_template_seg_yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/refcocog/_generate_config.py b/lmms_eval/tasks/refcocog/_generate_config.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/refcocog/_refcoco.yaml b/lmms_eval/tasks/refcocog/_refcoco.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/refcocog/refcocog_bbox_rec_test.yaml b/lmms_eval/tasks/refcocog/refcocog_bbox_rec_test.yaml
deleted file mode 100644
index 2f435979..00000000
--- a/lmms_eval/tasks/refcocog/refcocog_bbox_rec_test.yaml
+++ /dev/null
@@ -1,4 +0,0 @@
-group: refcocog_bbox_rec
-task: refcocog_bbox_rec_test
-include: _default_template_bbox_rec_yaml
-test_split: test
diff --git a/lmms_eval/tasks/refcocog/refcocog_bbox_rec_val.yaml b/lmms_eval/tasks/refcocog/refcocog_bbox_rec_val.yaml
deleted file mode 100644
index 5e19397a..00000000
--- a/lmms_eval/tasks/refcocog/refcocog_bbox_rec_val.yaml
+++ /dev/null
@@ -1,4 +0,0 @@
-group: refcocog_bbox_rec
-task: refcocog_bbox_rec_val
-include: _default_template_bbox_rec_yaml
-test_split: val
diff --git a/lmms_eval/tasks/refcocog/refcocog_bbox_test.yaml b/lmms_eval/tasks/refcocog/refcocog_bbox_test.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/refcocog/refcocog_bbox_val.yaml b/lmms_eval/tasks/refcocog/refcocog_bbox_val.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/refcocog/refcocog_seg_test.yaml b/lmms_eval/tasks/refcocog/refcocog_seg_test.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/refcocog/refcocog_seg_val.yaml b/lmms_eval/tasks/refcocog/refcocog_seg_val.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/refcocog/utils.py b/lmms_eval/tasks/refcocog/utils.py
old mode 100644
new mode 100755
index 4feb71cb..f1d43606
--- a/lmms_eval/tasks/refcocog/utils.py
+++ b/lmms_eval/tasks/refcocog/utils.py
@@ -49,7 +49,7 @@ def refcoco_process_result(doc, result):
 
 
 def refcoco_aggregation_result(results, metric):
-    scorers = [(Bleu(4), "Bleu_1"), (Bleu(4), "Bleu_2"), (Bleu(4), "Bleu_3"), (Bleu(4), "Bleu_4"), (Meteor(), "METEOR"), (Rouge(), "ROUGE_L"), (Cider(), "CIDEr")]#, (Spice(), "SPICE")]
+    scorers = [(Bleu(4), "Bleu_1"), (Bleu(4), "Bleu_2"), (Bleu(4), "Bleu_3"), (Bleu(4), "Bleu_4"), (Meteor(), "METEOR"), (Rouge(), "ROUGE_L"), (Cider(), "CIDEr")]  # , (Spice(), "SPICE")]
     scorers_dict = {s[1]: s for s in scorers}
 
     stored_results = []
diff --git a/lmms_eval/tasks/refcocog/utils_rec.py b/lmms_eval/tasks/refcocog/utils_rec.py
deleted file mode 100644
index ec91bf9c..00000000
--- a/lmms_eval/tasks/refcocog/utils_rec.py
+++ /dev/null
@@ -1,221 +0,0 @@
-import re 
-import logging
-from datasets import Dataset
-
-eval_logger = logging.getLogger("lmms-eval")
-
-COCO_REC_METRICS = ["IoU", "ACC@0.1", "ACC@0.3", "ACC@0.5", "ACC@0.7", "ACC@0.9", "Center_ACC"]
-
-
-def refcoco_bbox_rec_preprocess_dataset(dataset: Dataset):
-    # PIL image stored in dataset['image']
-    # add `image_width` and `image_height` to the dataset
-    dataset = dataset.map(lambda x: {"image_width": x["image"].width, "image_height": x["image"].height})
-
-    # Original bbox format (top x, top y, width, height)
-    # Convert to (top-left x, top-left y, bottom-right x, bottom-right y)
-    # Normalize the bounding box coordinates to be between 0 and 1 
-    # using the image width and height
-    dataset = dataset.map(
-        lambda x: {"bbox": [x["bbox"][0] / x["image_width"], 
-                            x["bbox"][1] / x["image_height"],
-                           (x["bbox"][0] + x["bbox"][2]) / x["image_width"],
-                           (x["bbox"][1] + x["bbox"][3]) / x["image_height"]]}
-    )
-
-    # currently, the dataset has `answer` as a list of strings
-    # each answer should be its own row
-    # we will explode the dataset to have one row per answer
-    # duplicate the other columns
-    def explode_answers(example):
-        answers = example.pop('answer')
-        return [{'answer': answer, **example} for answer in answers]
-
-    # Apply the function to each element, collecting the results
-    exploded_rows = []
-    for example in dataset:
-        exploded_rows.extend(explode_answers(example))
-
-    # Create a new dataset from the exploded rows
-    new_dataset = Dataset.from_list(exploded_rows)
-    print(f"Exploded dataset from {len(dataset)} to {len(new_dataset)} rows")
-
-    return new_dataset
-
-
-def refcoco_bbox_rec_doc_to_visual(doc):
-    # Image is presented as is
-    image = doc["image"].convert("RGB")
-    return [image.convert("RGB")]
-
-
-def refcoco_bbox_rec_doc_to_text(doc):
-    assert isinstance(doc['answer'], str), "Answer must be a string"
-    return "Bounding box coordinates are specified in the format (top-left x, top-left y, bottom-right x, bottom-right y). All values are floating point numbers bounded between 0 and 1. Please provide the bounding box coordinate of the region this sentence describes: " + doc['answer']
-
-
-def parse_float_sequence_within(input_str):
-    """
-    Extract the first sequence of four floating-point numbers within square brackets from a string.
-
-    Args:
-    input_str (str): A string that may contain a sequence of four floats within square brackets.
-
-    Returns:
-    list: A list of four floats if the pattern is found, or a list of four zeros if the pattern is not found.
-    """
-    # Define the regex pattern to find the first instance of four floats within square brackets
-    pattern = r'\[\s*(-?\d+(?:\.\d+)?),\s*(-?\d+(?:\.\d+)?),\s*(-?\d+(?:\.\d+)?),\s*(-?\d+(?:\.\d+)?)\s*\]'
-    
-    # Use re.search to find the first match of the pattern in the input string
-    match = re.search(pattern, input_str)
-    
-    # If a match is found, convert the captured groups into a list of floats
-    if match:
-        return [float(match.group(i)) for i in range(1, 5)]
-    
-    # If the input does not contain the pattern, return the null float sequence
-    return [0, 0, 0, 0]
-
-
-def refcoco_bbox_rec_process_result(doc, result):
-    """
-    Args:
-        doc: a instance of the eval dataset
-        results: [pred]
-    Returns:
-        a dictionary with key: metric name, value: metric value
-    """
-    pred = result[0] if len(result) > 0 else ""
-    pred = parse_float_sequence_within(pred)
-    ann_id = doc["question_id"]
-    data_dict = {"answer": doc["answer"], "pred": pred, "ann_id": ann_id, 'bbox': doc['bbox']}
-    return {f"refcoco_{metric}": data_dict for metric in COCO_REC_METRICS}
-
-
-def compute_iou(box1, box2):
-    """
-    Compute the Intersection over Union (IoU) of two bounding boxes.
-
-    Parameters:
-    - box1 (list of float): Bounding box [x_min, y_min, x_max, y_max].
-    - box2 (list of float): Bounding box [x_min, y_min, x_max, y_max].
-
-    Returns:
-    - float: IoU of box1 and box2.
-    """
-    # Determine the coordinates of the intersection rectangle
-    x_left = max(box1[0], box2[0])
-    y_top = max(box1[1], box2[1])
-    x_right = min(box1[2], box2[2])
-    y_bottom = min(box1[3], box2[3])
-
-    # Compute the area of intersection
-    intersection_area = max(0, x_right - x_left) * max(0, y_bottom - y_top)
-
-    # Compute the area of both bounding boxes
-    box1_area = (box1[2] - box1[0]) * (box1[3] - box1[1])
-    box2_area = (box2[2] - box2[0]) * (box2[3] - box2[1])
-
-    # Compute the area of the union
-    union_area = box1_area + box2_area - intersection_area
-
-    # Compute the Intersection over Union
-    iou = intersection_area / union_area
-
-    return iou
-
-
-def compute_accuracy(box1, box2, threshold=0.5):
-    """
-    Compute the accuracy of two bounding boxes based on a specified threshold.
-
-    Parameters:
-    - box1 (list of float): Bounding box [x_min, y_min, x_max, y_max].
-    - box2 (list of float): Bounding box [x_min, y_min, x_max, y_max].
-    - threshold (float): Threshold for the IoU to consider the prediction correct.
-
-    Returns:
-    - float: Accuracy of the prediction based on the IoU threshold.
-    """
-    iou = compute_iou(box1, box2)
-    return iou >= threshold
-
-
-def compute_center_accuracy(box1, box2):
-    """
-    Compute if the center point of box 2 is within box 1.
-
-    Parameters:
-    - box1 (list of float): Bounding box [x_min, y_min, x_max, y_max].
-    - box2 (list of float): Bounding box [x_min, y_min, x_max, y_max].
-
-    Returns:
-    - bool: True if the center point of box 2 is within box 1, False otherwise.
-    """
-    # Compute the center point of box 2
-    center_x = (box2[0] + box2[2]) / 2
-    center_y = (box2[1] + box2[3]) / 2
-
-    # Check if the center point is within box 1
-    return box1[0] <= center_x <= box1[2] and box1[1] <= center_y <= box1[3]
-
-
-def refcoco_bbox_rec_aggregation_result(results, metric):
-    """
-    Aggregate the results of the RefCOCO evaluation task using the specified metric.
-
-    Args:
-    - results (list of dict): List of result dictionaries.
-    - metric (str): Metric to use for aggregation.
-
-    Returns:
-    - dict: Dictionary containing the aggregated results for the specified metric.
-    """
-    scorers = {
-        'IoU': compute_iou,
-        'ACC@0.1': lambda x, y: compute_accuracy(x, y, 0.1),
-        'ACC@0.3': lambda x, y: compute_accuracy(x, y, 0.3),
-        'ACC@0.5': lambda x, y: compute_accuracy(x, y, 0.5),
-        'ACC@0.7': lambda x, y: compute_accuracy(x, y, 0.7),
-        'ACC@0.9': lambda x, y: compute_accuracy(x, y, 0.9),
-        'Center_ACC': compute_center_accuracy
-    }
-    results_dict = {metric: []}
-    for result in results:
-        # Extract the ground truth and predicted bounding boxes
-        gt_bbox = result['bbox']
-        pred_bbox = result['pred']
-        # Compute the specified metric between the ground truth and predicted bounding boxes
-        score = scorers[metric](gt_bbox, pred_bbox)
-        results_dict[metric].append(score)
-    results_dict[metric] = sum(results_dict[metric]) / len(results_dict[metric])
-    print(f"Aggregated {metric} score: {results_dict[metric]}")
-    return results_dict[metric]
-
-
-def refcoco_bbox_rec_iou(results):
-    return refcoco_bbox_rec_aggregation_result(results, "IoU")
-
-
-def refcoco_bbox_rec_acc01(results):
-    return refcoco_bbox_rec_aggregation_result(results, "ACC@0.1")
-
-def refcoco_bbox_rec_acc03(results):
-    return refcoco_bbox_rec_aggregation_result(results, "ACC@0.3")
-
-
-def refcoco_bbox_rec_acc05(results):
-    return refcoco_bbox_rec_aggregation_result(results, "ACC@0.5")
-
-
-def refcoco_bbox_rec_acc07(results):
-    return refcoco_bbox_rec_aggregation_result(results, "ACC@0.7")
-
-
-def refcoco_bbox_rec_acc09(results):
-    return refcoco_bbox_rec_aggregation_result(results, "ACC@0.9")
-
-
-def refcoco_bbox_rec_center_acc(results):
-    return refcoco_bbox_rec_aggregation_result(results, "Center_ACC")
diff --git a/lmms_eval/tasks/scienceqa/scienceqa.yaml b/lmms_eval/tasks/scienceqa/scienceqa.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/scienceqa/scienceqa_full.yaml b/lmms_eval/tasks/scienceqa/scienceqa_full.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/scienceqa/scienceqa_img.yaml b/lmms_eval/tasks/scienceqa/scienceqa_img.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/scienceqa/utils.py b/lmms_eval/tasks/scienceqa/utils.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/seedbench/seedbench.yaml b/lmms_eval/tasks/seedbench/seedbench.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/seedbench/seedbench_ppl.yaml b/lmms_eval/tasks/seedbench/seedbench_ppl.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/seedbench/utils.py b/lmms_eval/tasks/seedbench/utils.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/seedbench_2/seedbench_2.yaml b/lmms_eval/tasks/seedbench_2/seedbench_2.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/seedbench_2/utils.py b/lmms_eval/tasks/seedbench_2/utils.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/stvqa/stvqa.yaml b/lmms_eval/tasks/stvqa/stvqa.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/stvqa/utils.py b/lmms_eval/tasks/stvqa/utils.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/synthdog/donut_evaluator.py b/lmms_eval/tasks/synthdog/donut_evaluator.py
new file mode 100644
index 00000000..dd9b022f
--- /dev/null
+++ b/lmms_eval/tasks/synthdog/donut_evaluator.py
@@ -0,0 +1,227 @@
+import json
+import os
+import random
+from collections import defaultdict
+from typing import Any, Dict, List, Tuple, Union
+
+import torch
+from datasets import load_dataset
+from nltk import edit_distance
+from torch.utils.data import Dataset
+from transformers.modeling_utils import PreTrainedModel
+
+
+import logging
+
+eval_logger = logging.getLogger("lmms-eval")
+
+try:
+    import zss
+    from zss import Node
+except ImportError:
+    eval_logger.debug("Please install zss library. You can install it by running 'pip install zss'")
+
+
+class JSONParseEvaluator:
+    """
+    Calculate n-TED(Normalized Tree Edit Distance) based accuracy and F1 accuracy score
+    """
+
+    @staticmethod
+    def flatten(data: dict):
+        """
+        Convert Dictionary into Non-nested Dictionary
+        Example:
+            input(dict)
+                {
+                    "menu": [
+                        {"name" : ["cake"], "count" : ["2"]},
+                        {"name" : ["juice"], "count" : ["1"]},
+                    ]
+                }
+            output(list)
+                [
+                    ("menu.name", "cake"),
+                    ("menu.count", "2"),
+                    ("menu.name", "juice"),
+                    ("menu.count", "1"),
+                ]
+        """
+        flatten_data = list()
+
+        def _flatten(value, key=""):
+            if type(value) is dict:
+                for child_key, child_value in value.items():
+                    _flatten(child_value, f"{key}.{child_key}" if key else child_key)
+            elif type(value) is list:
+                for value_item in value:
+                    _flatten(value_item, key)
+            else:
+                flatten_data.append((key, value))
+
+        _flatten(data)
+        return flatten_data
+
+    @staticmethod
+    def update_cost(node1: Node, node2: Node):
+        """
+        Update cost for tree edit distance.
+        If both are leaf node, calculate string edit distance between two labels (special token '<leaf>' will be ignored).
+        If one of them is leaf node, cost is length of string in leaf node + 1.
+        If neither are leaf node, cost is 0 if label1 is same with label2 othewise 1
+        """
+        label1 = node1.label
+        label2 = node2.label
+        label1_leaf = "<leaf>" in label1
+        label2_leaf = "<leaf>" in label2
+        if label1_leaf == True and label2_leaf == True:
+            return edit_distance(label1.replace("<leaf>", ""), label2.replace("<leaf>", ""))
+        elif label1_leaf == False and label2_leaf == True:
+            return 1 + len(label2.replace("<leaf>", ""))
+        elif label1_leaf == True and label2_leaf == False:
+            return 1 + len(label1.replace("<leaf>", ""))
+        else:
+            return int(label1 != label2)
+
+    @staticmethod
+    def insert_and_remove_cost(node: Node):
+        """
+        Insert and remove cost for tree edit distance.
+        If leaf node, cost is length of label name.
+        Otherwise, 1
+        """
+        label = node.label
+        if "<leaf>" in label:
+            return len(label.replace("<leaf>", ""))
+        else:
+            return 1
+
+    def normalize_dict(self, data: Union[Dict, List, Any]):
+        """
+        Sort by value, while iterate over element if data is list
+        """
+        if not data:
+            return {}
+
+        if isinstance(data, dict):
+            new_data = dict()
+            for key in sorted(data.keys(), key=lambda k: (len(k), k)):
+                value = self.normalize_dict(data[key])
+                if value:
+                    if not isinstance(value, list):
+                        value = [value]
+                    new_data[key] = value
+
+        elif isinstance(data, list):
+            if all(isinstance(item, dict) for item in data):
+                new_data = []
+                for item in data:
+                    item = self.normalize_dict(item)
+                    if item:
+                        new_data.append(item)
+            else:
+                new_data = [str(item).strip() for item in data if type(item) in {str, int, float} and str(item).strip()]
+        else:
+            new_data = [str(data).strip()]
+
+        return new_data
+
+    def cal_f1(self, preds: List[dict], answers: List[dict]):
+        """
+        Calculate global F1 accuracy score (field-level, micro-averaged) by counting all true positives, false negatives and false positives
+        """
+        total_tp, total_fn_or_fp = 0, 0
+        for pred, answer in zip(preds, answers):
+            pred, answer = self.flatten(self.normalize_dict(pred)), self.flatten(self.normalize_dict(answer))
+            for field in pred:
+                if field in answer:
+                    total_tp += 1
+                    answer.remove(field)
+                else:
+                    total_fn_or_fp += 1
+            total_fn_or_fp += len(answer)
+        return total_tp / (total_tp + total_fn_or_fp / 2)
+
+    def construct_tree_from_dict(self, data: Union[Dict, List], node_name: str = None):
+        """
+        Convert Dictionary into Tree
+
+        Example:
+            input(dict)
+
+                {
+                    "menu": [
+                        {"name" : ["cake"], "count" : ["2"]},
+                        {"name" : ["juice"], "count" : ["1"]},
+                    ]
+                }
+
+            output(tree)
+                                     <root>
+                                       |
+                                     menu
+                                    /    \
+                             <subtree>  <subtree>
+                            /      |     |      \
+                         name    count  name    count
+                        /         |     |         \
+                  <leaf>cake  <leaf>2  <leaf>juice  <leaf>1
+         """
+        if node_name is None:
+            node_name = "<root>"
+
+        node = Node(node_name)
+
+        if isinstance(data, dict):
+            for key, value in data.items():
+                kid_node = self.construct_tree_from_dict(value, key)
+                node.addkid(kid_node)
+        elif isinstance(data, list):
+            if all(isinstance(item, dict) for item in data):
+                for item in data:
+                    kid_node = self.construct_tree_from_dict(
+                        item,
+                        "<subtree>",
+                    )
+                    node.addkid(kid_node)
+            else:
+                for item in data:
+                    node.addkid(Node(f"<leaf>{item}"))
+        else:
+            raise Exception(data, node_name)
+        return node
+
+    def cal_acc(self, pred: dict, answer: dict):
+        """
+        Calculate normalized tree edit distance(nTED) based accuracy.
+        1) Construct tree from dict,
+        2) Get tree distance with insert/remove/update cost,
+        3) Divide distance with GT tree size (i.e., nTED),
+        4) Calculate nTED based accuracy. (= max(1 - nTED, 0 ).
+        """
+        pred = self.construct_tree_from_dict(self.normalize_dict(pred))
+        answer = self.construct_tree_from_dict(self.normalize_dict(answer))
+        return max(
+            0,
+            1
+            - (
+                zss.distance(
+                    pred,
+                    answer,
+                    get_children=zss.Node.get_children,
+                    insert_cost=self.insert_and_remove_cost,
+                    remove_cost=self.insert_and_remove_cost,
+                    update_cost=self.update_cost,
+                    return_operations=False,
+                )
+                / zss.distance(
+                    self.construct_tree_from_dict(self.normalize_dict({})),
+                    answer,
+                    get_children=zss.Node.get_children,
+                    insert_cost=self.insert_and_remove_cost,
+                    remove_cost=self.insert_and_remove_cost,
+                    update_cost=self.update_cost,
+                    return_operations=False,
+                )
+            ),
+        )
diff --git a/lmms_eval/tasks/synthdog/synthdog.yaml b/lmms_eval/tasks/synthdog/synthdog.yaml
new file mode 100644
index 00000000..e9b1288b
--- /dev/null
+++ b/lmms_eval/tasks/synthdog/synthdog.yaml
@@ -0,0 +1,4 @@
+group: synthdog
+task:
+- synthdog_en
+- synthdog_zh
\ No newline at end of file
diff --git a/lmms_eval/tasks/synthdog/synthdog_en.yaml b/lmms_eval/tasks/synthdog/synthdog_en.yaml
new file mode 100644
index 00000000..d64cad78
--- /dev/null
+++ b/lmms_eval/tasks/synthdog/synthdog_en.yaml
@@ -0,0 +1,22 @@
+dataset_path: naver-clova-ix/synthdog-en
+dataset_kwargs:
+  token: True
+task: "synthdog_en"
+test_split: validation
+output_type: generate_until
+doc_to_visual: !function utils.synthdog_doc_to_visual
+doc_to_text: OCR this image section by section, from top to bottom, and left to right. Do not insert line breaks in the output text. If a word is split due to a line break in the image, use a space instead.
+doc_to_target: !function utils.synthdog_doc_to_target
+generation_kwargs:
+  max_new_tokens: 1024
+  temperature: 0
+  top_p: 1.0
+  num_beams: 1
+  do_sample: false
+process_results: !function utils.synthdog_process_results
+metric_list:
+  - metric: tree_edit_distance
+    aggregation: !function utils.synthdog_aggregate_ted
+    higher_is_better: true
+metadata:
+  - version: 0.0
diff --git a/lmms_eval/tasks/synthdog/synthdog_zh.yaml b/lmms_eval/tasks/synthdog/synthdog_zh.yaml
new file mode 100644
index 00000000..4a895ac2
--- /dev/null
+++ b/lmms_eval/tasks/synthdog/synthdog_zh.yaml
@@ -0,0 +1,22 @@
+dataset_path: naver-clova-ix/synthdog-zh
+dataset_kwargs:
+  token: True
+task: "synthdog_zh"
+test_split: validation
+output_type: generate_until
+doc_to_visual: !function utils.synthdog_doc_to_visual
+doc_to_text: OCR this image section by section, from top to bottom, and left to right. Do not insert line breaks in the output text. If a word is split due to a line break in the image, use a space instead.
+doc_to_target: !function utils.synthdog_doc_to_target
+generation_kwargs:
+  max_new_tokens: 1024
+  temperature: 0
+  top_p: 1.0
+  num_beams: 1
+  do_sample: false
+process_results: !function utils.synthdog_process_results
+metric_list:
+  - metric: tree_edit_distance
+    aggregation: !function utils.synthdog_aggregate_ted
+    higher_is_better: true
+metadata:
+  - version: 0.0
\ No newline at end of file
diff --git a/lmms_eval/tasks/synthdog/utils.py b/lmms_eval/tasks/synthdog/utils.py
new file mode 100644
index 00000000..42e7c62d
--- /dev/null
+++ b/lmms_eval/tasks/synthdog/utils.py
@@ -0,0 +1,45 @@
+import logging
+import json
+from lmms_eval.tasks._task_utils.file_utils import generate_submission_file
+from lmms_eval.tasks.synthdog.donut_evaluator import JSONParseEvaluator
+
+logger = logging.getLogger("lmms-eval")
+
+evaluator = JSONParseEvaluator()
+
+
+def synthdog_doc_to_visual(doc):
+    # Assuming the 'doc' dictionary has a key 'image' with image data
+    return [doc["image"].convert("RGB")]
+
+
+def synthdog_doc_to_target(doc):
+    # Assuming the 'doc' dictionary has a key 'image' with image data
+    return [json.loads(doc["ground_truth"])["gt_parse"]["text_sequence"]]
+
+
+def synthdog_process_results(doc, results):
+    pred = {"output": results[0].lower().strip()}
+    gt_ans = json.loads(doc["ground_truth"])["gt_parse"]
+
+    predictions = []
+    ground_truths = []
+    accs = []
+
+    score = evaluator.cal_acc(pred, gt_ans)
+
+    accs.append(score)
+
+    predictions.append(pred)
+    ground_truths.append(gt_ans)
+
+    return {
+        "tree_edit_distance": {"score": score, "prediction": pred, "ground_truth": gt_ans},
+    }
+
+
+def synthdog_aggregate_ted(results, args):
+    final_score = 0
+    for result in results:
+        final_score += result["score"]
+    return final_score
diff --git a/lmms_eval/tasks/tempcompass/_default_template_yaml b/lmms_eval/tasks/tempcompass/_default_template_yaml
new file mode 100644
index 00000000..404701d3
--- /dev/null
+++ b/lmms_eval/tasks/tempcompass/_default_template_yaml
@@ -0,0 +1,14 @@
+dataset_path: lmms-lab/TempCompass
+dataset_kwargs:
+  token: True
+  video: True
+  cache_dir: tempcompass
+model_specific_prompt_kwargs:
+  default:
+    pre_prompt: ""
+    post_prompt: {
+    "multi-choice": "\nPlease directly give the best option:",
+    "yes_no": "\nPlease answer yes or no:",
+    "caption_matching": "\nPlease directly give the best option:",
+    "captioning": ""
+}
\ No newline at end of file
diff --git a/lmms_eval/tasks/tempcompass/_tempcompass.yaml b/lmms_eval/tasks/tempcompass/_tempcompass.yaml
new file mode 100755
index 00000000..10044aa5
--- /dev/null
+++ b/lmms_eval/tasks/tempcompass/_tempcompass.yaml
@@ -0,0 +1,6 @@
+group: tempcompass
+task:
+- tempcompass_multi_choice
+- tempcompass_yes_no
+- tempcompass_caption_matching
+- tempcompass_captioning
diff --git a/lmms_eval/tasks/tempcompass/tempcompass_caption_matching.yaml b/lmms_eval/tasks/tempcompass/tempcompass_caption_matching.yaml
new file mode 100644
index 00000000..95c52faf
--- /dev/null
+++ b/lmms_eval/tasks/tempcompass/tempcompass_caption_matching.yaml
@@ -0,0 +1,28 @@
+dataset_name: "caption_matching"
+task: "tempcompass_caption_matching"
+test_split: test
+output_type: generate_until
+doc_to_visual: !function utils.tempcompass_doc_to_visual
+doc_to_text: !function utils.tempcompass_doc_to_text_caption_matching
+doc_to_target: !function utils.tempcompass_doc_to_answer
+process_results: !function utils.tempcompass_process_results_caption_matching
+metric_list:
+  - metric: avg_accuracy
+    aggregation: !function utils.tempcompass_aggregate_rating
+    higher_is_better: true
+  - metric: speed_accuracy
+    aggregation: !function utils.tempcompass_aggregate_rating
+    higher_is_better: true
+  - metric: direction_accuracy
+    aggregation: !function utils.tempcompass_aggregate_rating
+    higher_is_better: true
+  - metric: action_accuracy
+    aggregation: !function utils.tempcompass_aggregate_rating
+    higher_is_better: true
+  - metric: order_accuracy
+    aggregation: !function utils.tempcompass_aggregate_rating
+    higher_is_better: true
+  - metric: attribute_change_accuracy
+    aggregation: !function utils.tempcompass_aggregate_rating
+    higher_is_better: true
+include: _default_template_yaml
diff --git a/lmms_eval/tasks/tempcompass/tempcompass_captioning.yaml b/lmms_eval/tasks/tempcompass/tempcompass_captioning.yaml
new file mode 100644
index 00000000..1b3cf276
--- /dev/null
+++ b/lmms_eval/tasks/tempcompass/tempcompass_captioning.yaml
@@ -0,0 +1,28 @@
+dataset_name: "captioning"
+task: "tempcompass_captioning"
+test_split: test
+output_type: generate_until
+doc_to_visual: !function utils.tempcompass_doc_to_visual
+doc_to_text: !function utils.tempcompass_doc_to_text_captioning
+doc_to_target: !function utils.tempcompass_doc_to_answer
+process_results: !function utils.tempcompass_process_results_captioning
+metric_list:
+  - metric: avg_accuracy
+    aggregation: !function utils.tempcompass_aggregate_rating
+    higher_is_better: true
+  - metric: speed_accuracy
+    aggregation: !function utils.tempcompass_aggregate_rating
+    higher_is_better: true
+  - metric: direction_accuracy
+    aggregation: !function utils.tempcompass_aggregate_rating
+    higher_is_better: true
+  - metric: action_accuracy
+    aggregation: !function utils.tempcompass_aggregate_rating
+    higher_is_better: true
+  - metric: order_accuracy
+    aggregation: !function utils.tempcompass_aggregate_rating
+    higher_is_better: true
+  - metric: attribute_change_accuracy
+    aggregation: !function utils.tempcompass_aggregate_rating
+    higher_is_better: true
+include: _default_template_yaml
diff --git a/lmms_eval/tasks/tempcompass/tempcompass_mc.yaml b/lmms_eval/tasks/tempcompass/tempcompass_mc.yaml
new file mode 100644
index 00000000..e270ead4
--- /dev/null
+++ b/lmms_eval/tasks/tempcompass/tempcompass_mc.yaml
@@ -0,0 +1,28 @@
+dataset_name: "multi-choice"
+task: "tempcompass_multi_choice"
+test_split: test
+output_type: generate_until
+doc_to_visual: !function utils.tempcompass_doc_to_visual
+doc_to_text: !function utils.tempcompass_doc_to_text_multi_choice
+doc_to_target: !function utils.tempcompass_doc_to_answer
+process_results: !function utils.tempcompass_process_results_multi_choice
+metric_list:
+  - metric: avg_accuracy
+    aggregation: !function utils.tempcompass_aggregate_rating
+    higher_is_better: true
+  - metric: speed_accuracy
+    aggregation: !function utils.tempcompass_aggregate_rating
+    higher_is_better: true
+  - metric: direction_accuracy
+    aggregation: !function utils.tempcompass_aggregate_rating
+    higher_is_better: true
+  - metric: action_accuracy
+    aggregation: !function utils.tempcompass_aggregate_rating
+    higher_is_better: true
+  - metric: order_accuracy
+    aggregation: !function utils.tempcompass_aggregate_rating
+    higher_is_better: true
+  - metric: attribute_change_accuracy
+    aggregation: !function utils.tempcompass_aggregate_rating
+    higher_is_better: true
+include: _default_template_yaml
diff --git a/lmms_eval/tasks/tempcompass/tempcompass_yes_no.yaml b/lmms_eval/tasks/tempcompass/tempcompass_yes_no.yaml
new file mode 100644
index 00000000..cbddeae9
--- /dev/null
+++ b/lmms_eval/tasks/tempcompass/tempcompass_yes_no.yaml
@@ -0,0 +1,28 @@
+dataset_name: "yes_no"
+task: "tempcompass_yes_no"
+test_split: test
+output_type: generate_until
+doc_to_visual: !function utils.tempcompass_doc_to_visual
+doc_to_text: !function utils.tempcompass_doc_to_text_yes_no
+doc_to_target: !function utils.tempcompass_doc_to_answer
+process_results: !function utils.tempcompass_process_results_yes_no
+metric_list:
+  - metric: avg_accuracy
+    aggregation: !function utils.tempcompass_aggregate_rating
+    higher_is_better: true
+  - metric: speed_accuracy
+    aggregation: !function utils.tempcompass_aggregate_rating
+    higher_is_better: true
+  - metric: direction_accuracy
+    aggregation: !function utils.tempcompass_aggregate_rating
+    higher_is_better: true
+  - metric: action_accuracy
+    aggregation: !function utils.tempcompass_aggregate_rating
+    higher_is_better: true
+  - metric: order_accuracy
+    aggregation: !function utils.tempcompass_aggregate_rating
+    higher_is_better: true
+  - metric: attribute_change_accuracy
+    aggregation: !function utils.tempcompass_aggregate_rating
+    higher_is_better: true
+include: _default_template_yaml
diff --git a/lmms_eval/tasks/tempcompass/utils.py b/lmms_eval/tasks/tempcompass/utils.py
new file mode 100644
index 00000000..83574347
--- /dev/null
+++ b/lmms_eval/tasks/tempcompass/utils.py
@@ -0,0 +1,528 @@
+from decord import VideoReader, cpu
+import numpy as np
+import os
+import sys
+import datetime
+import lmms_eval.tasks._task_utils.file_utils as file_utils
+import json
+import logging
+import yaml
+from pathlib import Path
+
+import requests
+import openai
+from openai import OpenAI
+import time
+import ast
+from tqdm import tqdm
+import random
+
+import re
+
+with open(Path(__file__).parent / "_default_template_yaml", "r") as f:
+    raw_data = f.readlines()
+    safe_data = []
+    for i, line in enumerate(raw_data):
+        # remove function definition since yaml load cannot handle it
+        if "!function" not in line:
+            safe_data.append(line)
+
+    config = yaml.safe_load("".join(safe_data))
+
+
+API_TYPE = os.getenv("API_TYPE", "openai")
+
+if API_TYPE == "openai":
+    API_URL = os.getenv("OPENAI_API_URL", "https://api.openai.com/v1/chat/completions")
+    API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_API_KEY")
+    headers = {
+        "Authorization": f"Bearer {API_KEY}",
+        "Content-Type": "application/json",
+    }
+
+# We will unzip all the zip files
+# To HF HOME cache dir
+# And load it here
+HF_HOME = os.environ["HF_HOME"]
+cache_dir = config["dataset_kwargs"]["cache_dir"]
+cache_dir = os.path.join(HF_HOME, cache_dir)
+cache_dir = os.path.join(cache_dir, "videos")
+
+eval_logger = logging.getLogger("lmms-eval")
+
+
+# Pass in video path here
+# Can only work correctly with video llm
+def tempcompass_doc_to_visual(doc):
+    video_path = doc["video_id"] + ".mp4"
+    video_path = os.path.join(cache_dir, video_path)
+    if os.path.exists(video_path):
+        video_path = video_path
+    else:
+        sys.exit(f"video path:{video_path} does not exist, please check")
+    return [video_path]
+
+
+# This is the place where you format your question
+def tempcompass_doc_to_text_multi_choice(doc, model_specific_prompt_kwargs=None):
+    if model_specific_prompt_kwargs is None:
+        model_specific_prompt_kwargs = {}
+    pre_prompt = ""
+    post_prompt = ""
+    if "pre_prompt" in model_specific_prompt_kwargs:
+        pre_prompt = model_specific_prompt_kwargs["pre_prompt"]
+    if "post_prompt" in model_specific_prompt_kwargs:
+        post_prompt = model_specific_prompt_kwargs["post_prompt"]["multi-choice"]
+
+    question = doc["question"]
+    return f"{pre_prompt}{question}{post_prompt}"
+
+
+def tempcompass_doc_to_text_yes_no(doc, model_specific_prompt_kwargs=None):
+    if model_specific_prompt_kwargs is None:
+        model_specific_prompt_kwargs = {}
+    pre_prompt = ""
+    post_prompt = ""
+    if "pre_prompt" in model_specific_prompt_kwargs:
+        pre_prompt = model_specific_prompt_kwargs["pre_prompt"]
+    if "post_prompt" in model_specific_prompt_kwargs:
+        post_prompt = model_specific_prompt_kwargs["post_prompt"]["yes_no"]
+
+    question = doc["question"]
+    return f"{pre_prompt}{question}{post_prompt}"
+
+
+def tempcompass_doc_to_text_caption_matching(doc, model_specific_prompt_kwargs=None):
+    if model_specific_prompt_kwargs is None:
+        model_specific_prompt_kwargs = {}
+    pre_prompt = ""
+    post_prompt = ""
+    if "pre_prompt" in model_specific_prompt_kwargs:
+        pre_prompt = model_specific_prompt_kwargs["pre_prompt"]
+    if "post_prompt" in model_specific_prompt_kwargs:
+        post_prompt = model_specific_prompt_kwargs["post_prompt"]["caption_matching"]
+
+    question = doc["question"]
+    return f"{pre_prompt}{question}{post_prompt}"
+
+
+def tempcompass_doc_to_text_captioning(doc, model_specific_prompt_kwargs=None):
+    if model_specific_prompt_kwargs is None:
+        model_specific_prompt_kwargs = {}
+    pre_prompt = ""
+    post_prompt = ""
+    if "pre_prompt" in model_specific_prompt_kwargs:
+        pre_prompt = model_specific_prompt_kwargs["pre_prompt"]
+    if "post_prompt" in model_specific_prompt_kwargs:
+        post_prompt = model_specific_prompt_kwargs["post_prompt"]["captioning"]
+
+    question = doc["question"]
+    return f"{pre_prompt}{question}{post_prompt}"
+
+
+def tempcompass_doc_to_answer(doc):
+    return doc["answer"]
+
+
+# Process result for multi_choice
+def tempcompass_process_results_multi_choice(doc, result):
+    pred = result[0]
+    rating = 0
+    match_success = True
+    chatgpt_response = None
+
+    # Some hand-crafted matching rules
+    if pred == doc["answer"]:
+        rating = 1
+    elif pred in ["A", "B", "C", "D"]:
+        rating = 1 if pred == doc["answer"][0] else 0
+    elif any(pred.startswith(prefix) for prefix in ["A.", "B.", "C.", "D."]):
+        rating = 1 if pred.split(".")[0] == doc["answer"][0] else 0
+    elif any(pred.startswith(prefix) for prefix in ["A)", "B)", "C)", "D)"]):
+        rating = 1 if pred.split(")")[0] == doc["answer"][0] else 0
+    else:
+        # Fail to match answer in the video-llm response. Use ChatGPT to evaluate.
+        match_success = False
+
+        base_prompt = """
+        You will receive a multi-choice question, the ground-truth answer and the prediction from a question answering (QA) model. \
+        Your task is to determine whether QA model prediction is correct, based on the question and ground-truth answer. \
+        If the prediction is correct, respond "Correct". If the prediction is incorrect, respond "Incorrect".
+        """
+        prompt = f"""{base_prompt}\nMulti-Choice Question:\n{doc["question"]}\nGround-Truth Answer: {doc["answer"]}\nModel Prediction: {pred}"""
+        chatgpt_response, rating = get_eval_result(prompt)
+
+    if chatgpt_response:
+        return {
+            "avg_accuracy": {
+                "video_id": doc["video_id"],
+                "question": doc["question"],
+                "gt-answer": doc["answer"],
+                "video-llm-prediction": pred,
+                "match_success": match_success,
+                "rating": rating,
+                "chatgpt_response": chatgpt_response,
+                "dim": doc["dim"],
+            },
+            doc["dim"]
+            + "_accuracy": {
+                "video_id": doc["video_id"],
+                "question": doc["question"],
+                "gt-answer": doc["answer"],
+                "video-llm-prediction": pred,
+                "match_success": match_success,
+                "rating": rating,
+                "chatgpt_response": chatgpt_response,
+                "dim": doc["dim"],
+            },
+        }
+    else:
+        return {
+            "avg_accuracy": {"video_id": doc["video_id"], "question": doc["question"], "gt-answer": doc["answer"], "video-llm-prediction": pred, "match_success": match_success, "rating": rating, "dim": doc["dim"]},
+            doc["dim"] + "_accuracy": {"video_id": doc["video_id"], "question": doc["question"], "gt-answer": doc["answer"], "video-llm-prediction": pred, "match_success": match_success, "rating": rating, "dim": doc["dim"]},
+        }
+
+
+# Process result for yes_no
+def tempcompass_process_results_yes_no(doc, result):
+    pred = result[0]
+    rating = 0
+    match_success = True
+    chatgpt_response = None
+
+    yes_no_pred = extract_pred(pred)
+
+    # Some hand-crafted matching rules
+    if yes_no_pred:
+        rating = 1 if yes_no_pred == doc["answer"] else 0
+    else:
+        match_success = False  # Fail to match answer in the video-llm response. Use ChatGPT to evaluate.
+        base_prompt = """
+        You will receive a Yes/No question, the ground-truth answer and the prediction from a question answering (QA) model. \
+        Your task is to determine whether QA model prediction is correct, based on the question and ground-truth answer. \
+        If the prediction is correct, respond "Correct". If the prediction is incorrect, respond "Incorrect".
+        """
+        prompt = f"""{base_prompt}\nYes/No Question:\n{doc["question"]}\nGround-Truth Answer: {doc["answer"]}\nModel Prediction: {pred}"""
+        chatgpt_response, rating = get_eval_result(prompt)
+
+    if chatgpt_response:
+        return {
+            "avg_accuracy": {
+                "video_id": doc["video_id"],
+                "question": doc["question"],
+                "gt-answer": doc["answer"],
+                "video-llm-prediction": pred,
+                "match_success": match_success,
+                "rating": rating,
+                "chatgpt_response": chatgpt_response,
+                "dim": doc["dim"],
+            },
+            doc["dim"]
+            + "_accuracy": {
+                "video_id": doc["video_id"],
+                "question": doc["question"],
+                "gt-answer": doc["answer"],
+                "video-llm-prediction": pred,
+                "match_success": match_success,
+                "rating": rating,
+                "chatgpt_response": chatgpt_response,
+                "dim": doc["dim"],
+            },
+        }
+    else:
+        return {
+            "avg_accuracy": {"video_id": doc["video_id"], "question": doc["question"], "gt-answer": doc["answer"], "video-llm-prediction": pred, "match_success": match_success, "rating": rating, "dim": doc["dim"]},
+            doc["dim"] + "_accuracy": {"video_id": doc["video_id"], "question": doc["question"], "gt-answer": doc["answer"], "video-llm-prediction": pred, "match_success": match_success, "rating": rating, "dim": doc["dim"]},
+        }
+
+
+# Process result for caption_matching
+def tempcompass_process_results_caption_matching(doc, result):
+    pred = result[0]
+    rating = 0
+    match_success = True
+    chatgpt_response = None
+
+    eval_rule_rating = eval_rule(pred, doc["question"], doc["answer"])
+
+    # Some hand-crafted matching rules
+    if eval_rule_rating != "fail":
+        rating = eval_rule_rating
+    else:
+        match_success = False  # Fail to match answer in the video-llm response. Use ChatGPT to evaluate.
+        base_prompt = """
+        You will receive a caption matching question, the ground-truth answer and the prediction from a question answering (QA) model. \
+        Your task is to determine whether QA model prediction is correct, based on the question and ground-truth answer. \
+        If the prediction is correct, respond "Correct". If the prediction is incorrect, respond "Incorrect".
+        """
+        prompt = f"""{base_prompt}\nCaption Matching Question:\n{doc["question"]}\nGround-Truth Answer: {doc["answer"]}\nModel Prediction: {pred}"""
+        chatgpt_response, rating = get_eval_result(prompt)
+
+    if chatgpt_response:
+        return {
+            "avg_accuracy": {
+                "video_id": doc["video_id"],
+                "question": doc["question"],
+                "gt-answer": doc["answer"],
+                "video-llm-prediction": pred,
+                "match_success": match_success,
+                "rating": rating,
+                "chatgpt_response": chatgpt_response,
+                "dim": doc["dim"],
+            },
+            doc["dim"]
+            + "_accuracy": {
+                "video_id": doc["video_id"],
+                "question": doc["question"],
+                "gt-answer": doc["answer"],
+                "video-llm-prediction": pred,
+                "match_success": match_success,
+                "rating": rating,
+                "chatgpt_response": chatgpt_response,
+                "dim": doc["dim"],
+            },
+        }
+    else:
+        return {
+            "avg_accuracy": {"video_id": doc["video_id"], "question": doc["question"], "gt-answer": doc["answer"], "video-llm-prediction": pred, "match_success": match_success, "rating": rating, "dim": doc["dim"]},
+            doc["dim"] + "_accuracy": {"video_id": doc["video_id"], "question": doc["question"], "gt-answer": doc["answer"], "video-llm-prediction": pred, "match_success": match_success, "rating": rating, "dim": doc["dim"]},
+        }
+
+
+# Process result for captioning
+def tempcompass_process_results_captioning(doc, result):
+    pred = result[0]
+
+    caption_evaluation_prompt = """
+    You will receive a video description and a multi-choice question. Your task is to choose the correct answer and briefly explain the reason why you choose the answer. \
+    If none of the choice candidates are correct or the video description lacks enough information to answer the question, just answer "None of the choices are correct". \
+    Please organize your response in this format:
+    ```
+    Reasoning: [Your reason to obtain the answer]
+    Answer: [Your answer]
+    ```
+
+    Here are some examples of video description, multi-choice question and the expected answer:
+    ```
+    Video Description: A person is palying football.
+    Multi-Choice Question:
+    What is the person doing in the video?
+    A. cooking
+    B. palying football
+    C. playing basketball
+    D. reading book
+    Reasoning: The video description mentions that the person is playing football.
+    Answer: B. palying football
+
+    Video Description: A bird is flying clockwise.
+    Multi-Choice Question:
+    In which direction is the bird flying?
+    A. backwark
+    B. counter-clockwise
+    C. clockwise
+    D. downward
+    Reasoning: The video description mentions that the bird is flying clockwise
+    Answer: C. clockwise
+
+    Video Description: An air balloon is inflating.
+    Multi-Choice Question:
+    What is happening to the air balloon?
+    A. exploding
+    B. getting smaller
+    C. flying
+    Reasoning: The video description mentions that the air balloon is inflating, while none of the coices can be explained as inflating.
+    Answer: None of the choices are correct
+    ```
+    """
+
+    prompt = f"""{caption_evaluation_prompt}\nVideo Description:{pred}\nMulti-Choice Question:\n{doc["mc_question"]}\nAnswer:"""
+    eval_result = get_eval_result_for_captioning(prompt, mc_answer=doc["mc_answer"])
+
+    return {
+        "avg_accuracy": {
+            "video_id": doc["video_id"],
+            "question": doc["question"],
+            "chatgpt-reasoning": eval_result["chatgpt-reasoning"],
+            "chatgpt-answer": eval_result["chatgpt-answer"],
+            "video-llm-prediction": pred,
+            "gt-answer": doc["mc_answer"],
+            "rating": eval_result["rating"],
+            "dim": doc["dim"],
+        },
+        doc["dim"]
+        + "_accuracy": {
+            "video_id": doc["video_id"],
+            "question": doc["question"],
+            "chatgpt-reasoning": eval_result["chatgpt-reasoning"],
+            "chatgpt-answer": eval_result["chatgpt-answer"],
+            "video-llm-prediction": pred,
+            "gt-answer": doc["mc_answer"],
+            "rating": eval_result["rating"],
+            "dim": doc["dim"],
+        },
+    }
+
+
+# utils functions for captioning: parse gpt outputs
+def parse_llm_output_for_captioning(llm_output, gt_answer):
+
+    if llm_output == "invalid_request_error" or not llm_output:
+        eval_result = {"rating": -1, "chatgpt-answer": None, "chatgpt-reasoning": None}
+        return eval_result
+
+    eval_result = {}
+    lines = llm_output.split("\n")
+
+    for line in lines:
+        line = line.strip()
+        if "Reasoning" in line:
+            eval_result["chatgpt-reasoning"] = line.replace("Reasoning:", "").strip()
+        if "Answer" in line:
+            eval_result["chatgpt-answer"] = line.replace("Answer:", "").strip()
+
+    if not "chatgpt-answer" in eval_result:
+        eval_result["chatgpt-answer"] = llm_output
+    if not "chatgpt-reasoning" in eval_result:
+        eval_result["chatgpt-reasoning"] = None
+
+    # Check if the chatgpt answer is the ground-truth answer
+    answer_counts = sum(eval_result["chatgpt-answer"].count(prefix) for prefix in ["A.", "B.", "C.", "D."])  # calculate the number of 'A.', 'B.', 'C.', 'D.' in chatgpt-answer
+
+    if eval_result["chatgpt-answer"].split(". ")[0] == gt_answer.split(". ")[0] and answer_counts == 1:
+        eval_result["rating"] = 1
+    else:
+        eval_result["rating"] = 0
+    return eval_result
+
+
+# utils functions for captioning: get gpt outputs
+def get_llm_output_for_captioning(prompt):
+    data = {
+        "max_tokens": 128,
+        "model": "gpt-3.5-turbo-1106",
+        "temperature": 1.0,
+        "top_p": 1,
+        "presence_penalty": 1,
+        "messages": [{"role": "system", "content": "You are an AI assistant for question answering."}, {"role": "user", "content": prompt}],
+    }
+    response = requests.post(API_URL, headers=headers, data=json.dumps(data).encode("utf-8"))
+    result = response.content.decode("utf-8")
+    dict_result = json.loads(result)
+    token_count = dict_result["usage"]
+    try:
+        llm_output = dict_result["choices"][0]["message"]["content"].strip()
+    except:
+        if "error" in dict_result and dict_result["error"]["type"] == "invalid_request_error":
+            llm_output = "invalid_request_error"
+        else:
+            llm_output = ""
+    return llm_output, token_count
+
+
+# utils functions for captioning: consolidate and return gpt outputs
+def get_eval_result_for_captioning(prompt, mc_answer, maxtry=10):
+    while True:
+        try:
+            llm_output, token_count = get_llm_output_for_captioning(prompt)
+            eval_result = parse_llm_output_for_captioning(llm_output, gt_answer=mc_answer)
+            eval_result["token_count"] = token_count
+            return eval_result
+        except:
+            if maxtry <= 0:
+                eval_result = {"chatgpt-reasoning": None, "chatgpt-answer": None, "rating": -1, "token_count": None}
+                return eval_result
+            maxtry -= 1
+            print(f"Not success! {maxtry} retries remaining...")
+            time.sleep(random.uniform(1, 2))
+
+
+# utils function for caption_matching
+def eval_rule(video_llm_output, question, answer):
+    # Determine whether the video llm output is correct, based on word matching rules
+    option_strs = question.split("\n")[1:]  # complete option strings
+    option_sents = [opt.split(": ")[1] for opt in option_strs]  # option sentence
+    option_inds = [opt.split(": ")[0] for opt in option_strs] + [opt.split(": ")[0].replace("Sentence ", "").replace("Option ", "").replace("Caption ", "") for opt in option_strs]  # option index, e.g., Sentence A, Caption A, Option 1
+    video_llm_pred = None
+    for option_str in option_strs:
+        if option_str == video_llm_output:
+            video_llm_pred = option_str
+    for option_sent in option_sents:
+        if option_sent == video_llm_output or (") " in video_llm_output and option_sent == video_llm_output.split(") ")[1]):
+            video_llm_pred = option_sent
+    for option_ind in option_inds:
+        if option_ind == video_llm_output or option_ind == video_llm_output.replace(".", ""):
+            video_llm_pred = option_ind
+
+    if video_llm_pred is None:
+        return "fail"
+    else:
+        return 1 if video_llm_pred == answer or video_llm_pred == answer.split(":")[0] or video_llm_pred == answer.split(": ")[1] or video_llm_pred == answer.split(": ")[0].split()[1] else 0
+
+
+# utils function for yes_no
+def extract_pred(video_llm_output):
+
+    # Extract the yes/no predction from the original video llm output
+    video_llm_output = video_llm_output.lower()
+    if video_llm_output.startswith("yes"):
+        return "yes"
+    elif video_llm_output.startswith("no"):
+        return "no"
+    else:
+        return False
+
+
+# utils function for gpt_evaluation when rule-based matching is unsuccessful
+def get_eval_result(prompt, maxtry=10, sys_prompt=None):
+    llm_output = None
+    while True:
+        try:
+            llm_output = get_llm_output(prompt, sys_prompt)
+            rating = llm_output_to_rating(llm_output)
+            return llm_output, rating
+        except:
+            if maxtry <= 0:
+                return llm_output, 0
+            maxtry -= 1
+            print(f"Not success! {maxtry} retries remaining...")
+            time.sleep(random.uniform(1, 2))
+
+
+# utils function for gpt evaluation
+def get_llm_output(prompt, sys_prompt, max_tokens=128):
+    if sys_prompt is None:
+        sys_prompt = "You are an AI assistant for question answering."
+    data = {"max_tokens": max_tokens, "model": "gpt-3.5-turbo-1106", "temperature": 1.0, "top_p": 1, "presence_penalty": 1, "messages": [{"role": "system", "content": sys_prompt}, {"role": "user", "content": prompt}]}
+    response = requests.post(API_URL, headers=headers, data=json.dumps(data).encode("utf-8"))
+    result = response.content.decode("utf-8")
+    dict_result = json.loads(result)
+    llm_output = dict_result["choices"][0]["message"]["content"].strip()
+    return llm_output
+
+
+# utils function that converts gpt evaluation into rating
+def llm_output_to_rating(llm_output):
+    assert "Correct" in llm_output or "Incorrect" in llm_output
+    if llm_output.startswith("Correct"):
+        rating = 1
+    elif llm_output.startswith("Incorrect"):
+        rating = 0
+    elif ("Correct" in llm_output) and ("Incorrect" not in llm_output):
+        rating = 1
+    elif "Incorrect" in llm_output:
+        rating = 0
+    return rating
+
+
+# Factory into different aggregate
+def tempcompass_aggregate_rating(results, args):
+    yes_count = 0
+
+    # results is a list of dict
+    for answer_dict in results:
+        if answer_dict["rating"] == 1:
+            yes_count += 1
+
+    accuracy = yes_count / len(results)
+
+    return accuracy * 100
diff --git a/lmms_eval/tasks/textcaps/_default_template_textcaps_yaml b/lmms_eval/tasks/textcaps/_default_template_textcaps_yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/textcaps/textcaps.yaml b/lmms_eval/tasks/textcaps/textcaps.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/textcaps/textcaps_test.yaml b/lmms_eval/tasks/textcaps/textcaps_test.yaml
old mode 100644
new mode 100755
index 26369220..f5c2a915
--- a/lmms_eval/tasks/textcaps/textcaps_test.yaml
+++ b/lmms_eval/tasks/textcaps/textcaps_test.yaml
@@ -11,7 +11,7 @@ doc_to_target: "answer"
 generation_kwargs:
   max_new_tokens: 64
   temperature: 0
-  top_p: 0
+  top_p: 1.0
   num_beams: 1
   do_sample: false
 process_results: !function utils.textcaps_test_process_result
diff --git a/lmms_eval/tasks/textcaps/textcaps_train.yaml b/lmms_eval/tasks/textcaps/textcaps_train.yaml
old mode 100644
new mode 100755
index 931aeb17..9422bb4b
--- a/lmms_eval/tasks/textcaps/textcaps_train.yaml
+++ b/lmms_eval/tasks/textcaps/textcaps_train.yaml
@@ -13,7 +13,7 @@ generation_kwargs:
     - "ASSISTANT:"
   max_new_tokens: 1024
   temperature: 0
-  top_p: 0
+  top_p: 1.0
   num_beams: 1
   do_sample: false
 process_results: !function utils.textcaps_process_result
diff --git a/lmms_eval/tasks/textcaps/textcaps_val.yaml b/lmms_eval/tasks/textcaps/textcaps_val.yaml
old mode 100644
new mode 100755
index 41baec8e..426b1705
--- a/lmms_eval/tasks/textcaps/textcaps_val.yaml
+++ b/lmms_eval/tasks/textcaps/textcaps_val.yaml
@@ -11,7 +11,7 @@ doc_to_target: "answer"
 generation_kwargs:
   max_new_tokens: 64
   temperature: 0
-  top_p: 0
+  top_p: 1.0
   num_beams: 1
   do_sample: false
 process_results: !function utils.textcaps_process_result
diff --git a/lmms_eval/tasks/textcaps/utils.py b/lmms_eval/tasks/textcaps/utils.py
old mode 100644
new mode 100755
index c63feae5..12dc277f
--- a/lmms_eval/tasks/textcaps/utils.py
+++ b/lmms_eval/tasks/textcaps/utils.py
@@ -38,7 +38,7 @@ def textcaps_process_result(doc, result):
 
 
 def textcaps_aggregation_result(results, metric, args=None):
-    scorers = [(Bleu(4), "Bleu_1"), (Bleu(4), "Bleu_2"), (Bleu(4), "Bleu_3"), (Bleu(4), "Bleu_4"), (Meteor(), "METEOR"), (Rouge(), "ROUGE_L"), (Cider(), "CIDEr")]#, (Spice(), "SPICE")]
+    scorers = [(Bleu(4), "Bleu_1"), (Bleu(4), "Bleu_2"), (Bleu(4), "Bleu_3"), (Bleu(4), "Bleu_4"), (Meteor(), "METEOR"), (Rouge(), "ROUGE_L"), (Cider(), "CIDEr")]  # , (Spice(), "SPICE")]
     scorers_dict = {s[1]: s for s in scorers}
 
     stored_results = []
diff --git a/lmms_eval/tasks/textvqa/_default_template_textvqa_yaml b/lmms_eval/tasks/textvqa/_default_template_textvqa_yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/textvqa/_textvqa.yaml b/lmms_eval/tasks/textvqa/_textvqa.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/textvqa/textvqa_test.yaml b/lmms_eval/tasks/textvqa/textvqa_test.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/textvqa/textvqa_val.yaml b/lmms_eval/tasks/textvqa/textvqa_val.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/textvqa/utils.py b/lmms_eval/tasks/textvqa/utils.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/vatex/_vatex.yaml b/lmms_eval/tasks/vatex/_vatex.yaml
new file mode 100755
index 00000000..8dce4579
--- /dev/null
+++ b/lmms_eval/tasks/vatex/_vatex.yaml
@@ -0,0 +1,4 @@
+group : vatex
+task:
+- vatex_val_zh
+- vatex_test
\ No newline at end of file
diff --git a/lmms_eval/tasks/vatex/utils.py b/lmms_eval/tasks/vatex/utils.py
new file mode 100755
index 00000000..55ffa9c5
--- /dev/null
+++ b/lmms_eval/tasks/vatex/utils.py
@@ -0,0 +1,230 @@
+import os
+import json
+from pycocoevalcap.eval import COCOEvalCap, Bleu, Meteor, Rouge, Cider, Spice
+from pycocoevalcap.tokenizer.ptbtokenizer import PTBTokenizer
+from pycocotools.coco import COCO
+from lmms_eval.tasks._task_utils.file_utils import generate_submission_file
+from pathlib import Path
+import logging
+import yaml
+import sys
+
+eval_logger = logging.getLogger("lmms-eval")
+
+dir_name = os.path.dirname(os.path.abspath(__file__))
+
+VATEX_METRICS = ["Bleu_4", "Bleu_3", "Bleu_2", "Bleu_1", "METEOR", "ROUGE_L", "CIDEr"]  # , "SPICE"]
+
+# with open(Path(__file__).parent / "_default_template_yaml", "r") as f:
+#     raw_data = f.readlines()
+#     safe_data = []
+#     for i, line in enumerate(raw_data):
+#         # remove function definition since yaml load cannot handle it
+#         if "!function" not in line:
+#             safe_data.append(line)
+
+#     config = yaml.safe_load("".join(safe_data))
+
+hf_home = os.getenv("HF_HOME", "~/.cache/huggingface/")
+# cache_dir = os.path.join(hf_home, cache_dir)
+# base_cache_dir = config["dataset_kwargs"]["cache_dir"]
+base_cache_dir = os.path.expanduser(hf_home)
+
+
+def vatex_ZH_doc_to_visual(doc):
+    with open(Path(__file__).parent / "vatex_val_zh.yaml", "r") as f:
+        raw_data = f.readlines()
+        safe_data = []
+        for i, line in enumerate(raw_data):
+            # remove function definition since yaml load cannot handle it
+            if "!function" not in line:
+                safe_data.append(line)
+    cache_name = yaml.safe_load("".join(safe_data))["dataset_kwargs"]["cache_dir"]
+    cache_dir = os.path.join(base_cache_dir, cache_name)
+    video_path = doc["videoID"] + ".mp4"
+    video_path = os.path.join(cache_dir, video_path)
+    if os.path.exists(video_path):
+        video_path = video_path
+    elif os.path.exists(video_path.replace("mp4", "MP4")):
+        video_path = video_path.replace("mp4", "MP4")
+    elif os.path.exists(video_path.replace("mp4", "mkv")):
+        video_path = video_path.replace("mp4", "mkv")
+    else:
+        sys.exit(f"video path:{video_path} does not exist, please check")
+    return [video_path]
+
+
+def vatex_test_doc_to_visual(doc):
+    with open(Path(__file__).parent / "vatex_test.yaml", "r") as f:
+        raw_data = f.readlines()
+        safe_data = []
+        for i, line in enumerate(raw_data):
+            # remove function definition since yaml load cannot handle it
+            if "!function" not in line:
+                safe_data.append(line)
+    cache_name = yaml.safe_load("".join(safe_data))["dataset_kwargs"]["cache_dir"]
+    cache_dir = os.path.join(base_cache_dir, cache_name)
+    video_path = doc["videoID"] + ".mp4"
+    video_path = os.path.join(cache_dir, video_path)
+    if os.path.exists(video_path):
+        video_path = video_path
+    elif os.path.exists(video_path.replace("mp4", "MP4")):
+        video_path = video_path.replace("mp4", "MP4")
+    elif os.path.exists(video_path.replace("mp4", "mkv")):
+        video_path = video_path.replace("mp4", "mkv")
+    else:
+        sys.exit(f"video path:{video_path} does not exist, please check")
+    return [video_path]
+
+
+def vatex_ZH_doc_to_text(doc, model_specific_prompt_kwargs=None):
+    few_shot_prompt = """[视频1] 输出:一个穿黑运动服、戴红色头盔的男人正在攀登雪山。\n[视频2] 输出:一个戴着耳机男人在电脑面前模拟打架子鼓。\n[视频3] 输出:一个穿黑色短袖的男子的男子，双手十指交叉放在胸前，肘部放在面前的桌子上，桌子上有一台电脑，不一会儿，男子半个手臂都放在了桌子上。\n[视频4] 输出:一位女士在她的手上涂抹少量的面霜，并且在她的眼睛下涂抹。\n"""
+    return model_specific_prompt_kwargs["prompt"] + "\n" + few_shot_prompt
+
+
+def vatex_test_doc_to_text(doc, model_specific_prompt_kwargs=None):
+    few_shot_prompt = """[video1] output: A man picks up a can of shoe paste, a towel, and brush from a table.\n[video2] output: A person places the frying pan on the stove and then another person flips over the food that is in it.\n[video3] output: A woman describes and demonstrates how to create a colorful cross stitch design.\n[video4] output: A little girl uses the grass in her yard as well as a blue mat to do flips.\n"""
+    return model_specific_prompt_kwargs["prompt"] + "\n" + few_shot_prompt
+
+
+def vatex_process_result(doc, result):
+    """
+    Args:
+        doc: a instance of the eval dataset
+        results: [pred]
+    Returns:
+        a dictionary with key: metric name, value: metric value
+    """
+    pred = result[0] if len(result) > 0 else ""
+
+    data_dict = {"answer": doc["enCap"], "pred": pred, "video_id": doc["videoID"]}
+
+    return {f"vatex_{metric}": data_dict for metric in VATEX_METRICS}
+
+
+def vatex_process_CN_result(doc, result):
+    """
+    Args:
+        doc: a instance of the eval dataset
+        results: [pred]
+    Returns:
+        a dictionary with key: metric name, value: metric value
+    """
+    pred = result[0] if len(result) > 0 else ""
+
+    data_dict = {"answer": doc["chCap"], "pred": pred, "video_id": doc["videoID"]}
+
+    return {f"vatex_{metric}": data_dict for metric in VATEX_METRICS}
+
+
+def vatex_aggregation_result(results, metric, args=None):
+    scorers = [(Bleu(4), "Bleu_1"), (Bleu(4), "Bleu_2"), (Bleu(4), "Bleu_3"), (Bleu(4), "Bleu_4"), (Meteor(), "METEOR"), (Rouge(), "ROUGE_L"), (Cider(), "CIDEr")]  # , (Spice(), "SPICE")]
+    scorers_dict = {s[1]: s for s in scorers}
+
+    stored_results = []
+    # In order to make the coco eval tools to successfully create index
+    # We need at least two dict in the dataset
+    # 'annotation' and 'images'
+    # 'annotation' exactly reproduce the original annotation
+    # 'images' however only need the image id which is contained in the file name
+    dataset = {"annotations": [], "images": []}
+    idx = 0
+    for result in results:
+        stored_results.append({"image_id": result["video_id"], "caption": result["pred"]})
+        for a in result["answer"]:
+            dataset["annotations"].append({"image_id": result["video_id"], "caption": a, "id": idx})
+            idx += 1
+        dataset["images"].append({"id": result["video_id"]})
+
+    coco = COCO()
+    # Manually create index here
+    coco.dataset = dataset
+    coco.createIndex()
+
+    vatex_result = coco.loadRes(stored_results)
+    vatex_eval = COCOEvalCap(coco, vatex_result)
+
+    imgIds = vatex_eval.params["image_id"]
+    gts = {}
+    res = {}
+    for imgId in imgIds:
+        gts[imgId] = vatex_eval.coco.imgToAnns[imgId]
+        res[imgId] = vatex_eval.cocoRes.imgToAnns[imgId]
+
+    eval_logger.info("tokenization...")
+    tokenizer = PTBTokenizer()
+    gts = tokenizer.tokenize(gts)
+    res = tokenizer.tokenize(res)
+
+    eval_logger.info(f"Computing {metric} scores...")
+
+    score, scores = scorers_dict[metric][0].compute_score(gts, res)
+    # When metric is one of the Bleu, score will be a list
+    if type(score) == list:
+        n = int(metric.split("_")[-1])
+        score = score[n - 1]
+
+    path = generate_submission_file("vatex_captions_val_results.json", args)
+
+    with open(path, "w") as f:
+        json.dump(stored_results, f, indent=4)
+
+    eval_logger.info(f"Results saved to {path}")
+
+    return score
+
+
+def vatex_bleu4(results, args=None):
+    return vatex_aggregation_result(results, "Bleu_4", args)
+
+
+def vatex_bleu3(results, args=None):
+    return vatex_aggregation_result(results, "Bleu_3", args)
+
+
+def vatex_bleu2(results, args=None):
+    return vatex_aggregation_result(results, "Bleu_2", args)
+
+
+def vatex_bleu1(results, args=None):
+    return vatex_aggregation_result(results, "Bleu_1", args)
+
+
+def vatex_meteor(results, args=None):
+    return vatex_aggregation_result(results, "METEOR", args)
+
+
+def vatex_rougel(results, args=None):
+    return vatex_aggregation_result(results, "ROUGE_L", args)
+
+
+def vatex_cider(results, args=None):
+    return vatex_aggregation_result(results, "CIDEr", args)
+
+
+def vatex_spice(results, args=None):
+    return vatex_aggregation_result(results, "SPICE", args)
+
+
+def vatex_test_process_result(doc, result):
+    """
+    Args:
+        doc: a instance of the eval dataset
+        results: [pred]
+    Returns:
+        a dictionary with key: metric name (in this case vatex_passthrough), value: metric value
+    """
+    return {"vatex_passthrough": {"pred": result, "image_id": doc["image_id"]}}
+
+
+def vatex_test_aggregation_result(results, args):
+    stored_results = []
+    for result in results:
+        stored_results.append({"image_id": result["image_id"], "caption": result["pred"]})
+
+    path = generate_submission_file("vatex_captions_test2014_alg_results.json", args)
+    eval_logger.info("Storing prediction that can be submitted to the server ...")
+    with open(path, "w") as f:
+        json.dump(stored_results, f, indent=4)
+
+    eval_logger.info(f"Your test result has been stored into {path}. Make sure you also have the val result stored to submit to the server on https://codalab.lisn.upsaclay.fr/competitions/7404#participate.")
diff --git a/lmms_eval/tasks/vatex/vatex_test.yaml b/lmms_eval/tasks/vatex/vatex_test.yaml
new file mode 100644
index 00000000..0087e977
--- /dev/null
+++ b/lmms_eval/tasks/vatex/vatex_test.yaml
@@ -0,0 +1,46 @@
+dataset_path: lmms-lab/VATEX
+dataset_name: vatex_test
+task: vatex_test
+test_split: test
+output_type: generate_until
+doc_to_visual: !function utils.vatex_test_doc_to_visual
+doc_to_text: !function utils.vatex_test_doc_to_text
+doc_to_target: "answer"
+
+process_results: !function utils.vatex_process_result
+# Note that the metric name can be either a registed metric function (such as the case for GQA) or a key name returned by process_results
+metric_list:
+  - metric: vatex_Bleu_4
+    aggregation: !function utils.vatex_bleu4
+    higher_is_better: true
+  - metric: vatex_METEOR
+    aggregation: !function utils.vatex_meteor
+    higher_is_better: true
+  - metric: vatex_ROUGE_L
+    aggregation: !function utils.vatex_rougel
+    higher_is_better: true
+  - metric: vatex_CIDEr
+    aggregation: !function utils.vatex_cider
+    higher_is_better: true
+metadata:
+  version: 0.0
+
+dataset_kwargs:
+  token: True
+  video: True #skip download video from hf
+  # force_unzip: True
+  cache_dir: vatex_test
+  # From_YouTube: True
+
+model_specific_prompt_kwargs:
+  default:
+    prompt: Provide a brief single-sentence caption for the last video below. Do not give any reasoning, just the caption. You must follow the captioning style of the preceding videos. Do not start your response with "Output:", just provide the caption.
+  gemini_api:
+    prompt: Provide a brief single-sentence caption for the last video below. Do not give any reasoning, just the caption. You must follow the captioning style of the preceding videos. Do not start your response with "Output:", just provide the caption.
+
+generation_kwargs:
+  max_new_tokens: 64
+  temperature: 0
+  top_p: 1.0
+  num_beams: 1
+  do_sample: false
diff --git a/lmms_eval/tasks/vatex/vatex_val_zh.yaml b/lmms_eval/tasks/vatex/vatex_val_zh.yaml
new file mode 100755
index 00000000..cc2a866b
--- /dev/null
+++ b/lmms_eval/tasks/vatex/vatex_val_zh.yaml
@@ -0,0 +1,45 @@
+dataset_path: lmms-lab/VATEX_ZH
+dataset_name: vatex_val_zh
+task: vatex_val_zh
+test_split: validation
+output_type: generate_until
+doc_to_visual: !function utils.vatex_ZH_doc_to_visual
+doc_to_text: !function utils.vatex_ZH_doc_to_text
+doc_to_target: "answer"
+generation_kwargs:
+  max_new_tokens: 64
+  temperature: 0
+  top_p: 1.0
+  num_beams: 1
+  do_sample: false
+model_specific_prompt_kwargs:
+  default:
+    prompt: 请为提供的视频提供简短的描述。不要给出任何理由，只提供描述。您必须沿用前面视频的描述样式。不需要以 "输出"开头，只需提供描述即可.
+
+  gemini_api:
+    prompt: 请为提供的视频提供简短的描述。不要给出任何理由，只提供描述。您必须沿用前面视频的描述样式。不需要以 "输出"开头，只需提供描述即可.
+
+process_results: !function utils.vatex_process_CN_result
+# Note that the metric name can be either a registed metric function (such as the case for GQA) or a key name returned by process_results
+metric_list:
+  - metric: vatex_Bleu_4 
+    aggregation : !function utils.vatex_bleu4
+    higher_is_better : true
+  - metric: vatex_METEOR
+    aggregation : !function utils.vatex_meteor
+    higher_is_better : true
+  - metric: vatex_ROUGE_L
+    aggregation : !function utils.vatex_rougel
+    higher_is_better : true
+  - metric: vatex_CIDEr
+    aggregation : !function utils.vatex_cider
+    higher_is_better : true
+metadata:
+  - version: 0.0
+# include: _default_template_yaml
+
+dataset_kwargs:
+  token: True
+  video: True #skip download video from hf
+  cache_dir: vatex_val_zh
+  # From_YouTube: True
diff --git a/lmms_eval/tasks/video_detail_description/README.md b/lmms_eval/tasks/video_detail_description/README.md
new file mode 100644
index 00000000..2cc73b46
--- /dev/null
+++ b/lmms_eval/tasks/video_detail_description/README.md
@@ -0,0 +1,28 @@
+# Video Detail Description
+
+## Task Description
+
+This repository contains an evaluation dataset designed for assessing the performance of video models. The dataset includes human-generated detailed descriptions of videos, which have been used to generate several QA pairs with the help of GPT-3.5. The evaluation focuses on multiple dimensions of the responses generated by GPT-3.5.
+
+- Questions: Each question in this dataset follows the format: "Please provide a detailed description of the video, focusing on the main subjects, their actions, the background scenes."
+
+- Answers: The answers are extracted from the "human-generated detailed description" of the videos.
+
+- GPT-3.5 Evaluation: The answers are evaluated using a prompt we designed, which rates the responses based on the aforementioned dimensions with `gpt-3.5-turbo-0613`.
+
+## Groups & Tasks
+
+### Tasks
+
+- `vdd_499`: Given a question and a video, generate detail description of this video.
+  
+## Citation
+
+```bibtex
+@article{Maaz2023VideoChatGPT,
+    title={Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models},
+    author={Maaz, Muhammad and Rasheed, Hanoona and Khan, Salman and Khan, Fahad Shahbaz},
+    journal={arXiv:2306.05424},
+    year={2023}
+}
+```
\ No newline at end of file
diff --git a/lmms_eval/tasks/video_detail_description/_default_template_yaml b/lmms_eval/tasks/video_detail_description/_default_template_yaml
new file mode 100644
index 00000000..914bf8ee
--- /dev/null
+++ b/lmms_eval/tasks/video_detail_description/_default_template_yaml
@@ -0,0 +1,13 @@
+dataset_path: lmms-lab/VideoDetailDescription
+dataset_kwargs:
+  token: True
+  video: True
+  cache_dir: videochatgpt
+model_specific_prompt_kwargs:
+  default:
+    pre_prompt: ""
+    post_prompt: ""
+
+metadata:
+  version: 0.0
+  gpt_eval_model_name: gpt-3.5-turbo-0613
\ No newline at end of file
diff --git a/lmms_eval/tasks/video_detail_description/utils.py b/lmms_eval/tasks/video_detail_description/utils.py
new file mode 100755
index 00000000..382175cd
--- /dev/null
+++ b/lmms_eval/tasks/video_detail_description/utils.py
@@ -0,0 +1,220 @@
+import requests
+import time
+import ast
+import os
+import sys
+import datetime
+import lmms_eval.tasks._task_utils.file_utils as file_utils
+import json
+import logging
+import yaml
+from pathlib import Path
+
+with open(Path(__file__).parent / "_default_template_yaml", "r") as f:
+    raw_data = f.readlines()
+    safe_data = []
+    for i, line in enumerate(raw_data):
+        # remove function definition since yaml load cannot handle it
+        if "!function" not in line:
+            safe_data.append(line)
+
+    config = yaml.safe_load("".join(safe_data))
+
+
+NUM_SECONDS_TO_SLEEP = 5
+
+GPT_EVAL_MODEL_NAME = config["metadata"]["gpt_eval_model_name"]
+
+API_TYPE = os.getenv("API_TYPE", "openai")
+
+if API_TYPE == "openai":
+    API_URL = os.getenv("OPENAI_API_URL", "https://api.openai.com/v1/chat/completions")
+    API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_API_KEY")
+    headers = {
+        "Authorization": f"Bearer {API_KEY}",
+        "Content-Type": "application/json",
+    }
+
+# A bit ugly here
+# But the idea is that we will unzip all the zip files
+# To HF HOME cache dir
+# And load it here
+HF_HOME = os.environ["HF_HOME"]
+cache_dir = config["dataset_kwargs"]["cache_dir"]
+cache_dir = os.path.join(HF_HOME, cache_dir)
+cache_dir = os.path.join(cache_dir, "Test_Videos")
+
+eval_logger = logging.getLogger("lmms-eval")
+
+
+# Pass in video path here
+# Can only work correctly with video llm
+def video_detail_description_doc_to_visual(doc):
+    video_path = doc["video_name"] + ".mp4"
+    video_path = os.path.join(cache_dir, video_path)
+    if os.path.exists(video_path):
+        video_path = video_path
+    elif os.path.exists(video_path.replace("mp4", "MP4")):
+        video_path = video_path.replace("mp4", "MP4")
+    elif os.path.exists(video_path.replace("mp4", "mkv")):
+        video_path = video_path.replace("mp4", "mkv")
+    else:
+        sys.exit(f"video path:{video_path} does not exist, please check")
+    return [video_path]
+
+
+# format the question
+def video_detail_description_doc_to_text(doc, model_specific_prompt_kwargs=None):
+    if model_specific_prompt_kwargs is None:
+        model_specific_prompt_kwargs = {}
+    pre_prompt = ""
+    post_prompt = ""
+    if "pre_prompt" in model_specific_prompt_kwargs:
+        pre_prompt = model_specific_prompt_kwargs["pre_prompt"]
+    if "post_prompt" in model_specific_prompt_kwargs:
+        post_prompt = model_specific_prompt_kwargs["post_prompt"]
+
+    question = doc["question"]
+    return f"{pre_prompt}{question}{post_prompt}"
+
+
+def video_detail_description_doc_to_answer(doc):
+    return doc["answer"]
+
+
+def get_eval_generic(question, answer, pred, max_tokens: int, retries: int = 5):
+    global headers
+
+    messages = [
+        {
+            "role": "system",
+            "content": "You are an intelligent chatbot designed for evaluating the detail orientation of generative outputs for video-based question-answer pairs. "
+            "Your task is to compare the predicted answer with the correct answer and determine its level of detail, considering both completeness and specificity. Here's how you can accomplish the task:"
+            "------"
+            "##INSTRUCTIONS: "
+            "- Check if the predicted answer covers all major points from the video. The response should not leave out any key aspects.\n"
+            "- Evaluate whether the predicted answer includes specific details rather than just generic points. It should provide comprehensive information that is tied to specific elements of the video.\n"
+            "- Consider synonyms or paraphrases as valid matches.\n"
+            "- Provide a single evaluation score that reflects the level of detail orientation of the prediction, considering both completeness and specificity.",
+        },
+        {
+            "role": "user",
+            "content": "Please evaluate the following video-based question-answer pair:\n\n"
+            f"Question: {question}\n"
+            f"Correct Answer: {answer}\n"
+            f"Predicted Answer: {pred}\n\n"
+            "Provide your evaluation only as a detail orientation score where the detail orientation score is an integer value between 0 and 5, with 5 indicating the highest level of detail orientation. "
+            "Please generate the response in the form of a Python dictionary string with keys 'score', where its value is the detail orientation score in INTEGER, not STRING."
+            "DO NOT PROVIDE ANY OTHER OUTPUT TEXT OR EXPLANATION. Only provide the Python dictionary string. "
+            "For example, your response should look like this: {'score': 4.8}.",
+        },
+    ]
+
+    payload = {
+        "model": GPT_EVAL_MODEL_NAME,
+        "messages": messages,
+        "temperature": 0,
+        "max_tokens": max_tokens,
+        # "response_format": {"type": "json_object"},
+    }
+
+    for attempt in range(retries):
+        try:
+            response = requests.post(API_URL, headers=headers, json=payload, timeout=60)
+            response.raise_for_status()  # Raises HTTPError for bad responses
+            try:
+                response_data = response.json()  # Attempt to parse JSON
+            except requests.exceptions.JSONDecodeError:
+                eval_logger.error(f"JSON decode error on attempt {attempt + 1}. Response text: {response.text}")
+                continue  # Skip to next retry
+            content = response_data["choices"][0]["message"]["content"].strip()
+            if content != "":
+                return content, response_data["model"]
+        # Handle HTTP errors separately
+        except requests.exceptions.HTTPError as e:
+            eval_logger.error(f"HTTP error on attempt {attempt + 1}: {e}")
+        # Handle other requests-related errors
+        except requests.exceptions.RequestException as e:
+            eval_logger.error(f"Request exception on attempt {attempt + 1}: {e}")
+        except Exception as e:
+            eval_logger.error(f"Unexpected error on attempt {attempt + 1}: {e}")
+
+        if "Sorry! We've encountered an issue with repetitive patterns in your prompt. Please try again with a different prompt." in json.loads(response.content)["error"]["message"]:
+            eval_logger.error(f"Repetitive patterns in prompt. Drop this data.")
+            return "", ""
+
+        # Handle other unexpected errors
+        if attempt < retries - 1:
+            time.sleep(NUM_SECONDS_TO_SLEEP)
+        else:  # If this was the last attempt, log and return empty
+            eval_logger.error(f"All {retries} attempts failed.")
+            return "", ""
+
+    return "", ""
+
+
+def parse_score(review):
+    try:
+        # Convert the string representation of a dictionary to an actual dictionary
+        review_dict = ast.literal_eval(review)
+        score = review_dict.get("score", 0)
+        return int(score)
+    except SyntaxError as e:
+        eval_logger.error(f"Syntax error parsing the review string: {e}. Review content: {review}")
+        return 0
+    except ValueError as e:
+        eval_logger.error(f"Value error parsing the review string: {e}. Review content: {review}")
+        return 0
+    except Exception as e:
+        eval_logger.error(f"Unexpected error parsing the review string: {e}. Review content: {review}")
+        return 0
+
+
+def gpt_eval(data_dict):
+    evaluated_results = []
+
+    try:
+        question = data_dict["question"]
+        answer = data_dict["answer"]
+        pred = data_dict["pred"]
+
+        # Assume get_eval returns a review and the model name, and parse_score parses this review
+        review, model_name = get_eval_generic(question, answer, pred, 64)
+        score = parse_score(review)
+    except Exception as e:
+        eval_logger.error(f"Error for Video Name: {data_dict.get('video_name', 'Unknown')}: {e}")
+        review = "Failed to Get a Proper Review."
+        model_name = ""
+        score = 0
+
+    # Update the dictionary with the new entries
+    updated_dict = {
+        "video_name": data_dict["video_name"],
+        "review": review,
+        "score": score,
+    }
+
+    return updated_dict
+
+
+# Process result for evaluation in generic task
+def video_detail_description_process_results_generic(doc, result):
+    pred = result[0]
+    doc["pred"] = pred
+    eval_results = gpt_eval(doc)
+
+    return {"gpt_eval_score": {"video_name": doc["video_name"], "question": doc["question"], "answer": doc["answer"], "pred": pred, "score": eval_results["score"], "review": eval_results["review"]}}
+
+
+def video_detail_description_aggregate_score(results, args):
+    score = 0
+    for result in results:
+        eval_score = result["score"]
+        try:
+            eval_score = int(eval_score)
+        except:
+            eval_score = 0.0
+
+        score += eval_score
+
+    return score / len(results)
diff --git a/lmms_eval/tasks/video_detail_description/video_detail_description.yaml b/lmms_eval/tasks/video_detail_description/video_detail_description.yaml
new file mode 100755
index 00000000..189c6240
--- /dev/null
+++ b/lmms_eval/tasks/video_detail_description/video_detail_description.yaml
@@ -0,0 +1,12 @@
+task: "video_dc499"
+test_split: test
+output_type: generate_until
+doc_to_visual: !function utils.video_detail_description_doc_to_visual
+doc_to_text: !function utils.video_detail_description_doc_to_text
+doc_to_target: !function utils.video_detail_description_doc_to_answer
+process_results: !function utils.video_detail_description_process_results_generic
+metric_list:
+  - metric: gpt_eval_score
+    aggregation: !function utils.video_detail_description_aggregate_score
+    higher_is_better: true
+include: _default_template_yaml
\ No newline at end of file
diff --git a/lmms_eval/tasks/videochatgpt/_default_template_yaml b/lmms_eval/tasks/videochatgpt/_default_template_yaml
new file mode 100644
index 00000000..816297cc
--- /dev/null
+++ b/lmms_eval/tasks/videochatgpt/_default_template_yaml
@@ -0,0 +1,13 @@
+dataset_path: lmms-lab/VideoChatGPT
+dataset_kwargs:
+  token: True
+  video: True
+  cache_dir: videochatgpt
+model_specific_prompt_kwargs:
+  default:
+    pre_prompt: ""
+    post_prompt: ""
+
+metadata:
+  version: 0.0
+  gpt_eval_model_name: gpt-3.5-turbo-0613
\ No newline at end of file
diff --git a/lmms_eval/tasks/videochatgpt/_videochatgpt.yaml b/lmms_eval/tasks/videochatgpt/_videochatgpt.yaml
new file mode 100755
index 00000000..b29f95e9
--- /dev/null
+++ b/lmms_eval/tasks/videochatgpt/_videochatgpt.yaml
@@ -0,0 +1,5 @@
+group: videochatgpt
+task:
+- videochatgpt_gen
+- videochatgpt_temporal
+- videochatgpt_consistency
diff --git a/lmms_eval/tasks/videochatgpt/utils.py b/lmms_eval/tasks/videochatgpt/utils.py
new file mode 100755
index 00000000..ba6cf07d
--- /dev/null
+++ b/lmms_eval/tasks/videochatgpt/utils.py
@@ -0,0 +1,580 @@
+from decord import VideoReader, cpu
+import numpy as np
+import os
+import sys
+import datetime
+import lmms_eval.tasks._task_utils.file_utils as file_utils
+import json
+import logging
+import yaml
+from pathlib import Path
+
+import requests
+import openai
+from openai import OpenAI
+import time
+import ast
+from tqdm import tqdm
+
+eval_logger = logging.getLogger("lmms-eval")
+
+with open(Path(__file__).parent / "_default_template_yaml", "r") as f:
+    raw_data = f.readlines()
+    safe_data = []
+    for i, line in enumerate(raw_data):
+        # remove function definition since yaml load cannot handle it
+        if "!function" not in line:
+            safe_data.append(line)
+
+    config = yaml.safe_load("".join(safe_data))
+
+NUM_SECONDS_TO_SLEEP = 5
+
+GPT_EVAL_MODEL_NAME = config["metadata"]["gpt_eval_model_name"]
+
+API_TYPE = os.getenv("API_TYPE", "openai")
+
+if API_TYPE == "openai":
+    API_URL = os.getenv("OPENAI_API_URL", "https://api.openai.com/v1/chat/completions")
+    API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_API_KEY")
+    headers = {
+        "Authorization": f"Bearer {API_KEY}",
+        "Content-Type": "application/json",
+    }
+
+# Unzip all the zip files to HF HOME cache dir
+HF_HOME = os.environ["HF_HOME"]
+cache_dir = config["dataset_kwargs"]["cache_dir"]
+cache_dir = os.path.join(HF_HOME, cache_dir)
+cache_dir = os.path.join(cache_dir, "Test_Videos")
+
+
+# Pass in video path here
+# Can only work correctly with video llm
+def videochatgpt_doc_to_visual(doc):
+    video_path = doc["video_name"] + ".mp4"
+    video_path = os.path.join(cache_dir, video_path)
+    if os.path.exists(video_path):
+        video_path = video_path
+    elif os.path.exists(video_path.replace("mp4", "MP4")):
+        video_path = video_path.replace("mp4", "MP4")
+    elif os.path.exists(video_path.replace("mp4", "mkv")):
+        video_path = video_path.replace("mp4", "mkv")
+    else:
+        sys.exit(f"video path:{video_path} does not exist, please check")
+    return [video_path]
+
+
+# format the question
+def videochatgpt_doc_to_text(doc, model_specific_prompt_kwargs=None):
+    if model_specific_prompt_kwargs is None:
+        model_specific_prompt_kwargs = {}
+    pre_prompt = ""
+    post_prompt = ""
+    if "pre_prompt" in model_specific_prompt_kwargs:
+        pre_prompt = model_specific_prompt_kwargs["pre_prompt"]
+    if "post_prompt" in model_specific_prompt_kwargs:
+        post_prompt = model_specific_prompt_kwargs["post_prompt"]
+
+    question = doc["question"]
+
+    return f"{pre_prompt}{question}{post_prompt}"
+
+
+# format the question for consistency
+def videochatgpt_doc_to_text_consistency(doc, model_specific_prompt_kwargs=None):
+    if model_specific_prompt_kwargs is None:
+        model_specific_prompt_kwargs = {}
+    pre_prompt = ""
+    post_prompt = ""
+    if "pre_prompt" in model_specific_prompt_kwargs:
+        pre_prompt = model_specific_prompt_kwargs["pre_prompt"]
+    if "post_prompt" in model_specific_prompt_kwargs:
+        post_prompt = model_specific_prompt_kwargs["post_prompt"]
+
+    if doc["question_1"] != "None":
+        question = doc["question_1"]
+    else:
+        question = doc["question_2"]
+
+    return f"{pre_prompt}{question}{post_prompt}"
+
+
+# format answer
+def videochatgpt_doc_to_answer(doc):
+    return doc["answer"]
+
+
+# Process result for evaluation in generic task
+def videochatgpt_process_results_generic(doc, result):
+    """
+    Args:
+        doc: a instance of the eval dataset
+        results: [pred]
+    Returns:
+        a dictionary
+    """
+    try:
+        question = doc["question"]
+        answer = doc["answer"]
+        pred = result[0]
+
+        # Assume get_eval returns a review and the model name, and parse_score parses this review
+        review_correctness, model_name = get_eval_generic(question, answer, pred, "correctness", 64)
+        score_correctness = parse_score(review_correctness)
+        review_detailed_orientation, model_name = get_eval_generic(question, answer, pred, "detailed_orientation", 64)
+        score_detailed_orientation = parse_score(review_detailed_orientation)
+        review_context, model_name = get_eval_generic(question, answer, pred, "context", 64)
+        score_context = parse_score(review_context)
+
+    except Exception as e:
+        eval_logger.error(f"Error for Question ID: {doc.get('question_id', 'Unknown')}: {e}")
+        review = "Failed to Get a Proper Review."
+        model_name = "Failed Request"
+        score_correctness = 0
+        score_detailed_orientation = 0
+        score_context = 0
+
+    return {
+        "gpt_eval_score_correctness": {"video_name": doc["video_name"], "Q": doc["question"], "A": doc["answer"], "pred": pred, "score": score_correctness},
+        "gpt_eval_score_detailed_orientation": {"video_name": doc["video_name"], "Q": doc["question"], "A": doc["answer"], "pred": pred, "score": score_detailed_orientation},
+        "gpt_eval_score_context": {"video_name": doc["video_name"], "Q": doc["question"], "A": doc["answer"], "pred": pred, "score": score_context},
+    }
+
+
+# Process result for evaluation in temporal task
+def videochatgpt_process_results_temporal(doc, result):
+    """
+    Args:
+        doc: a instance of the eval dataset
+        results: [pred]
+    Returns:
+        a dictionary
+    """
+    try:
+        question = doc["question"]
+        answer = doc["answer"]
+        pred = result[0]
+
+        # Assume get_eval returns a review and the model name, and parse_score parses this review
+        review, model_name = get_eval_generic(question, answer, pred, "temporal", 64)
+        score = parse_score(review)
+    except Exception as e:
+        eval_logger.error(f"Error for Question ID: {doc.get('question_id', 'Unknown')}: {e}")
+        review = "Failed to Get a Proper Review."
+        model_name = "Failed Request"
+        score = 0
+
+    return {"gpt_eval_score_temporal": {"video_name": doc["video_name"], "Q": doc["question"], "A": doc["answer"], "pred": pred, "score": score}}
+
+
+# Process result for generation in consistency task
+def videochatgpt_process_results_consistency(doc, result, full_docs=None):
+    pred = result[0]
+
+    # if it is question_1, then assign prediction for the 1st question
+    # else assign prediction for the 2nd question
+    if doc["question_1"] != "None":
+        return {"gpt_eval_score_consistency": {"video_name": doc["video_name"], "Q1": doc["question_1"], "A": doc["answer"], "pred1": pred}}
+    else:
+        return {"gpt_eval_score_consistency": {"video_name": doc["video_name"], "Q2": doc["question_2"], "A": doc["answer"], "pred2": pred}}
+
+
+def videochatgpt_aggregate_submissions_consistency(results, args, task):
+    now_date_time = datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
+    submission_file_name = f"inference_results_videochatgpt_{task}_{now_date_time}.json"
+    path = file_utils.generate_submission_file(submission_file_name, args)
+
+    combined_results = []
+    processed_indices = set()
+
+    # Iterate through results to find pairs in order to avoid multiprocessing bugs
+    for i in range(len(results)):
+        if i in processed_indices:
+            continue
+
+        first_dict = results[i]
+        video_name = first_dict.get("video_name")
+
+        for j in range(i + 1, len(results)):
+            if j in processed_indices:
+                continue
+
+            second_dict = results[j]
+            if video_name == second_dict.get("video_name"):
+                first_dict.update({"Q2": second_dict.get("Q2"), "pred2": second_dict.get("pred2")})
+                processed_indices.add(i)
+                processed_indices.add(j)
+                combined_results.append(first_dict)
+                break
+
+    with open(path, "w") as f:
+        json.dump(combined_results, f, indent=4)
+
+    eval_logger.info(f"Submission file saved to {path}")
+
+    return path
+
+
+def get_eval_generic(question, answer, pred, task, max_tokens: int, retries: int = 5):
+    global headers
+
+    if task == "correctness":
+        messages = [
+            {
+                "role": "system",
+                "content": "You are an intelligent chatbot designed for evaluating the factual accuracy of generative outputs for video-based question-answer pairs. "
+                "Your task is to compare the predicted answer with the correct answer and determine if they are factually consistent. Here's how you can accomplish the task:"
+                "------"
+                "##INSTRUCTIONS: "
+                "- Focus on the factual consistency between the predicted answer and the correct answer. The predicted answer should not contain any misinterpretations or misinformation.\n"
+                "- The predicted answer must be factually accurate and align with the video content.\n"
+                "- Consider synonyms or paraphrases as valid matches.\n"
+                "- Evaluate the factual accuracy of the prediction compared to the answer.",
+            },
+            {
+                "role": "user",
+                "content": "Please evaluate the following video-based question-answer pair:\n\n"
+                f"Question: {question}\n"
+                f"Correct Answer: {answer}\n"
+                f"Predicted Answer: {pred}\n\n"
+                "Provide your evaluation only as a factual accuracy score where the factual accuracy score is an integer value between 0 and 5, with 5 indicating the highest level of factual consistency. "
+                "Please generate the response in the form of a Python dictionary string with keys 'score', where its value is the factual accuracy score in INTEGER, not STRING."
+                "DO NOT PROVIDE ANY OTHER OUTPUT TEXT OR EXPLANATION. Only provide the Python dictionary string. "
+                "For example, your response should look like this: {''score': 4.8}.",
+            },
+        ]
+    elif task == "detailed_orientation":
+        messages = [
+            {
+                "role": "system",
+                "content": "You are an intelligent chatbot designed for evaluating the detail orientation of generative outputs for video-based question-answer pairs. "
+                "Your task is to compare the predicted answer with the correct answer and determine its level of detail, considering both completeness and specificity. Here's how you can accomplish the task:"
+                "------"
+                "##INSTRUCTIONS: "
+                "- Check if the predicted answer covers all major points from the video. The response should not leave out any key aspects.\n"
+                "- Evaluate whether the predicted answer includes specific details rather than just generic points. It should provide comprehensive information that is tied to specific elements of the video.\n"
+                "- Consider synonyms or paraphrases as valid matches.\n"
+                "- Provide a single evaluation score that reflects the level of detail orientation of the prediction, considering both completeness and specificity.",
+            },
+            {
+                "role": "user",
+                "content": "Please evaluate the following video-based question-answer pair:\n\n"
+                f"Question: {question}\n"
+                f"Correct Answer: {answer}\n"
+                f"Predicted Answer: {pred}\n\n"
+                "Provide your evaluation only as a detail orientation score where the detail orientation score is an integer value between 0 and 5, with 5 indicating the highest level of detail orientation. "
+                "Please generate the response in the form of a Python dictionary string with keys 'score', where its value is the detail orientation score in INTEGER, not STRING."
+                "DO NOT PROVIDE ANY OTHER OUTPUT TEXT OR EXPLANATION. Only provide the Python dictionary string. "
+                "For example, your response should look like this: {''score': 4.8}.",
+            },
+        ]
+    elif task == "context":
+        messages = [
+            {
+                "role": "system",
+                "content": "You are an intelligent chatbot designed for evaluating the contextual understanding of generative outputs for video-based question-answer pairs. "
+                "Your task is to compare the predicted answer with the correct answer and determine if the generated response aligns with the overall context of the video content. Here's how you can accomplish the task:"
+                "------"
+                "##INSTRUCTIONS: "
+                "- Evaluate whether the predicted answer aligns with the overall context of the video content. It should not provide information that is out of context or misaligned.\n"
+                "- The predicted answer must capture the main themes and sentiments of the video.\n"
+                "- Consider synonyms or paraphrases as valid matches.\n"
+                "- Provide your evaluation of the contextual understanding of the prediction compared to the answer.",
+            },
+            {
+                "role": "user",
+                "content": "Please evaluate the following video-based question-answer pair:\n\n"
+                f"Question: {question}\n"
+                f"Correct Answer: {answer}\n"
+                f"Predicted Answer: {pred}\n\n"
+                "Provide your evaluation only as a contextual understanding score where the contextual understanding score is an integer value between 0 and 5, with 5 indicating the highest level of contextual understanding. "
+                "Please generate the response in the form of a Python dictionary string with keys 'score', where its value is contextual understanding score in INTEGER, not STRING."
+                "DO NOT PROVIDE ANY OTHER OUTPUT TEXT OR EXPLANATION. Only provide the Python dictionary string. "
+                "For example, your response should look like this: {''score': 4.8}.",
+            },
+        ]
+    elif task == "temporal":
+        messages = [
+            {
+                "role": "system",
+                "content": "You are an intelligent chatbot designed for evaluating the temporal understanding of generative outputs for video-based question-answer pairs. "
+                "Your task is to compare the predicted answer with the correct answer and determine if they correctly reflect the temporal sequence of events in the video content. Here's how you can accomplish the task:"
+                "------"
+                "##INSTRUCTIONS: "
+                "- Focus on the temporal consistency between the predicted answer and the correct answer. The predicted answer should correctly reflect the sequence of events or details as they are presented in the video content.\n"
+                "- Consider synonyms or paraphrases as valid matches, but only if the temporal order is maintained.\n"
+                "- Evaluate the temporal accuracy of the prediction compared to the answer.",
+            },
+            {
+                "role": "user",
+                "content": "Please evaluate the following video-based question-answer pair:\n\n"
+                f"Question: {question}\n"
+                f"Correct Answer: {answer}\n"
+                f"Predicted Answer: {pred}\n\n"
+                "Provide your evaluation only as a temporal accuracy score where the temporal accuracy score is an integer value between 0 and 5, with 5 indicating the highest level of temporal consistency. "
+                "Please generate the response in the form of a Python dictionary string with keys 'score', where its value is the temporal accuracy score in INTEGER, not STRING."
+                "DO NOT PROVIDE ANY OTHER OUTPUT TEXT OR EXPLANATION. Only provide the Python dictionary string. "
+                "For example, your response should look like this: {''score': 4.8}.",
+            },
+        ]
+
+    payload = {
+        "model": GPT_EVAL_MODEL_NAME,
+        "messages": messages,
+        "temperature": 0,
+        "max_tokens": max_tokens,
+    }
+
+    for attempt in range(retries):
+        try:
+            response = requests.post(API_URL, headers=headers, json=payload, timeout=60)
+            response.raise_for_status()  # Raises HTTPError for bad responses
+            try:
+                response_data = response.json()  # Attempt to parse JSON
+            except requests.exceptions.JSONDecodeError:
+                eval_logger.error(f"JSON decode error on attempt {attempt + 1}. Response text: {response.text}")
+                continue  # Skip to next retry
+            content = response_data["choices"][0]["message"]["content"].strip()
+            if content != "":
+                return content, response_data["model"]
+        # Handle HTTP errors separately
+        except requests.exceptions.HTTPError as e:
+            eval_logger.error(f"HTTP error on attempt {attempt + 1}: {e}")
+        # Handle other requests-related errors
+        except requests.exceptions.RequestException as e:
+            eval_logger.error(f"Request exception on attempt {attempt + 1}: {e}")
+        except Exception as e:
+            eval_logger.error(f"Unexpected error on attempt {attempt + 1}: {e}")
+
+        # Handle other unexpected errors
+        if attempt < retries - 1:
+            time.sleep(NUM_SECONDS_TO_SLEEP)
+        else:  # If this was the last attempt, log and return empty
+            eval_logger.error(f"All {retries} attempts failed. Last error message: {e}")
+            return "", ""
+
+    return "", ""
+
+
+def get_eval_consistency(question1, question2, answer, pred1, pred2, max_tokens: int, retries: int = 5):
+    global headers
+
+    messages = [
+        {
+            "role": "system",
+            "content": "You are an intelligent chatbot designed for evaluating the consistency of generative outputs for similar video-based question-answer pairs. "
+            "You will be given two very similar questions, a common answer common to both the questions and predicted answers for the two questions ."
+            "Your task is to compare the predicted answers for two very similar question, with a common correct answer and determine if they are consistent. Here's how you can accomplish the task:"
+            "------"
+            "##INSTRUCTIONS: "
+            "- Focus on the consistency between the two predicted answers and the correct answer. Both predicted answers should correspond to the correct answer and to each other, and should not contain any contradictions or significant differences in the conveyed information.\n"
+            "- Both predicted answers must be consistent with each other and the correct answer, in terms of the information they provide about the video content.\n"
+            "- Consider synonyms or paraphrases as valid matches, but only if they maintain the consistency in the conveyed information.\n"
+            "- Evaluate the consistency of the two predicted answers compared to the correct answer.",
+        },
+        {
+            "role": "user",
+            "content": "Please evaluate the following video-based question-answer pair:\n\n"
+            f"Question 1: {question1}\n"
+            f"Question 2: {question2}\n"
+            f"Correct Answer: {answer}\n"
+            f"Predicted Answer to Question 1: {pred1}\n"
+            f"Predicted Answer to Question 2: {pred2}\n\n"
+            "Provide your evaluation only as a consistency score where the consistency score is an integer value between 0 and 5, with 5 indicating the highest level of consistency. "
+            "Please generate the response in the form of a Python dictionary string with keys 'score', where its value is the consistency score in INTEGER, not STRING."
+            "DO NOT PROVIDE ANY OTHER OUTPUT TEXT OR EXPLANATION. Only provide the Python dictionary string. "
+            "For example, your response should look like this: {''score': 4.8}.",
+        },
+    ]
+
+    payload = {
+        "model": GPT_EVAL_MODEL_NAME,
+        "messages": messages,
+        "temperature": 0,
+        "max_tokens": max_tokens,
+    }
+
+    for attempt in range(retries):
+        try:
+            response = requests.post(API_URL, headers=headers, json=payload, timeout=60)
+            response.raise_for_status()  # Raises HTTPError for bad responses
+            try:
+                response_data = response.json()  # Attempt to parse JSON
+            except requests.exceptions.JSONDecodeError:
+                eval_logger.error(f"JSON decode error on attempt {attempt + 1}. Response text: {response.text}")
+                continue  # Skip to next retry
+            content = response_data["choices"][0]["message"]["content"].strip()
+            if content != "":
+                return content, response_data["model"]
+        # Handle HTTP errors separately
+        except requests.exceptions.HTTPError as e:
+            eval_logger.error(f"HTTP error on attempt {attempt + 1}: {e}")
+        # Handle other requests-related errors
+        except requests.exceptions.RequestException as e:
+            eval_logger.error(f"Request exception on attempt {attempt + 1}: {e}")
+        except Exception as e:
+            eval_logger.error(f"Unexpected error on attempt {attempt + 1}: {e}")
+
+        # Handle other unexpected errors
+        if attempt < retries - 1:
+            time.sleep(NUM_SECONDS_TO_SLEEP)
+        else:  # If this was the last attempt, log and return empty
+            eval_logger.error(f"All {retries} attempts failed. Last error message: {e}")
+            return "", ""
+
+    return "", ""
+
+
+def parse_score(review):
+    try:
+        # Convert the string representation of a dictionary to an actual dictionary
+        review_dict = ast.literal_eval(review)
+        score = review_dict.get("score", 0)
+        return int(score)
+    except SyntaxError as e:
+        eval_logger.error(f"Syntax error parsing the review string: {e}. Review content: {review}")
+        return 0
+    except ValueError as e:
+        eval_logger.error(f"Value error parsing the review string: {e}. Review content: {review}")
+        return 0
+    except Exception as e:
+        eval_logger.error(f"Unexpected error parsing the review string: {e}. Review content: {review}")
+        return 0
+
+
+def videochatgpt_print_scores(eval_file_path, args, task):
+    # Load the predictions from the result file
+    with open(eval_file_path, "r") as file:
+        evaluated_list = json.load(file)
+
+    now_date_time = datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
+    score_file_name = f"scores_videochatgpt_{task}_{now_date_time}.json"
+    path = file_utils.generate_submission_file(score_file_name, args)
+
+    # Compute average score
+    total_score = 0
+
+    # Iterate over the results to sum scores
+    for result_dict in evaluated_list:
+        total_score += result_dict["score"]
+
+    # Calculate accuracy and average score
+    average_score = total_score / len(evaluated_list) if evaluated_list else 0
+
+    # Write the processed data to the scores file
+    with open(path, "w") as f:
+        json.dump({"average_score": average_score}, f, indent=4)
+
+    eval_logger.info(f"Score file saved to {path}")
+
+    return average_score
+
+
+def videochatgpt_gpt_eval(result_file_path, args, task):
+    """
+    Process the result file containing predictions, score them using GPT,
+    and save the results with added scores and correctness fields to a new file.
+
+    Args:
+        result_file_path: path to the JSON file with results to be evaluated
+    """
+    now_date_time = datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
+    eval_file_name = f"gpt_eval_result_videochatgpt_{task}_{now_date_time}.json"
+    eval_file_path = file_utils.generate_submission_file(eval_file_name, args)
+
+    # Load the predictions from the result file
+    with open(result_file_path, "r") as file:
+        result_list = json.load(file)
+
+    evaluated_results = []
+
+    # Load the predictions from the result file
+    with open(result_file_path, "r") as file:
+        result_list = json.load(file)
+
+    # Process each result to generate scores
+    # If task is consistency (2 questions with 2 answers)
+    if task == "consistency":
+        for data_dict in tqdm(result_list, desc="GPT-Eval-for-Consistency"):
+            try:
+                question1 = data_dict.get("Q1", "")
+                question2 = data_dict.get("Q2", "")
+                answer = data_dict.get("A", "")
+                pred1 = data_dict.get("pred1", "")
+                pred2 = data_dict.get("pred2", "")
+
+                # Assume get_eval returns a review and the model name, and parse_score parses this review
+                review, model_name = get_eval_consistency(question1, question2, answer, pred1, pred2, 64)
+                score = parse_score(review)
+            except Exception as e:
+                eval_logger.error(f"Error for Video Name: {data_dict.get('video_name', 'Unknown')}: {e}")
+                review = "Failed to Get a Proper Review."
+                model_name = "Failed Request"
+                score = 0
+
+            # Update the dictionary with the new entries
+            updated_dict = {
+                "video_name": data_dict["video_name"],
+                "score": score,
+                "Q1": question1,
+                "Q2": question2,
+                "A": answer,
+                "pred1": pred1,
+                "pred2": pred2,
+            }
+            evaluated_results.append(updated_dict)
+    # If task is correctness, context, detail, temporal (1 question with 1 answer)
+    else:
+        # Process each result to generate scores
+        for data_dict in result_list:
+            try:
+                question = data_dict.get("Q", "")
+                answer = data_dict.get("A", "")
+                pred = data_dict.get("pred", "")
+
+                # Assume get_eval returns a review and the model name, and parse_score parses this review
+                review, model_name = get_eval_generic(question, answer, pred, task, 64)
+                score = parse_score(review)
+            except Exception as e:
+                eval_logger.error(f"Error for Video Name: {data_dict.get('video_name', 'Unknown')}: {e}")
+                review = "Failed to Get a Proper Review."
+                model_name = "Failed Request"
+                score = 0
+
+            # Update the dictionary with the new entries
+            updated_dict = {
+                "video_name": data_dict["video_name"],
+                "score": score,
+                "Q": question,
+                "A": answer,
+                "pred": pred,
+            }
+            evaluated_results.append(updated_dict)
+
+    # Save the evaluated results to a new JSON file
+    with open(eval_file_path, "w") as f:
+        json.dump(evaluated_results, f, indent=4)
+
+    return eval_file_path
+
+
+# Factory into different aggregate
+def videochatgpt_aggregate_consistency(results, args):
+    result_file_path = videochatgpt_aggregate_submissions_consistency(results, args, "consistency")
+    eval_file_path = videochatgpt_gpt_eval(result_file_path, args, "consistency")
+    average_score = videochatgpt_print_scores(eval_file_path, args, "consistency")
+    return average_score
+
+
+# Factory into different aggregate
+def videochatgpt_aggregate_score(results, args):
+    total_score = 0
+
+    # Iterate over the results to sum scores
+    for result_dict in results:
+        total_score += result_dict["score"]
+
+    average_score = total_score / len(results) if results else 0
+    eval_logger.info(f"Average Score: {average_score}")
+    return average_score
diff --git a/lmms_eval/tasks/videochatgpt/videochatgpt_consistency.yaml b/lmms_eval/tasks/videochatgpt/videochatgpt_consistency.yaml
new file mode 100755
index 00000000..936878bf
--- /dev/null
+++ b/lmms_eval/tasks/videochatgpt/videochatgpt_consistency.yaml
@@ -0,0 +1,24 @@
+dataset_name: "Consistency"
+task: "videochatgpt_consistency"
+test_split: test
+output_type: generate_until
+doc_to_visual: !function utils.videochatgpt_doc_to_visual
+doc_to_text: !function utils.videochatgpt_doc_to_text_consistency
+doc_to_target: !function utils.videochatgpt_doc_to_answer
+process_results: !function utils.videochatgpt_process_results_consistency
+metric_list:
+  - metric: gpt_eval_score_consistency
+    aggregation: !function utils.videochatgpt_aggregate_consistency
+    higher_is_better: true
+include: _default_template_yaml
+
+generation_kwargs:
+  until:
+    - "ASSISTANT:"
+  image_aspect_ratio: original
+  max_new_tokens: 1024
+  temperature: 0
+  top_p: 1.0
+  num_beams: 1
+  do_sample: false
+  
\ No newline at end of file
diff --git a/lmms_eval/tasks/videochatgpt/videochatgpt_generic.yaml b/lmms_eval/tasks/videochatgpt/videochatgpt_generic.yaml
new file mode 100755
index 00000000..9affe534
--- /dev/null
+++ b/lmms_eval/tasks/videochatgpt/videochatgpt_generic.yaml
@@ -0,0 +1,19 @@
+dataset_name: "Generic"
+task: "videochatgpt_gen"
+test_split: test
+output_type: generate_until
+doc_to_visual: !function utils.videochatgpt_doc_to_visual
+doc_to_text: !function utils.videochatgpt_doc_to_text
+doc_to_target: !function utils.videochatgpt_doc_to_answer
+process_results: !function utils.videochatgpt_process_results_generic
+metric_list:
+  - metric: gpt_eval_score_correctness
+    aggregation: !function utils.videochatgpt_aggregate_score
+    higher_is_better: true
+  - metric: gpt_eval_score_detailed_orientation
+    aggregation: !function utils.videochatgpt_aggregate_score
+    higher_is_better: true
+  - metric: gpt_eval_score_context
+    aggregation: !function utils.videochatgpt_aggregate_score
+    higher_is_better: true
+include: _default_template_yaml
diff --git a/lmms_eval/tasks/videochatgpt/videochatgpt_temporal.yaml b/lmms_eval/tasks/videochatgpt/videochatgpt_temporal.yaml
new file mode 100755
index 00000000..1e207c2a
--- /dev/null
+++ b/lmms_eval/tasks/videochatgpt/videochatgpt_temporal.yaml
@@ -0,0 +1,23 @@
+dataset_name: "Temporal"
+task: "videochatgpt_temporal"
+test_split: test
+output_type: generate_until
+doc_to_visual: !function utils.videochatgpt_doc_to_visual
+doc_to_text: !function utils.videochatgpt_doc_to_text
+doc_to_target: !function utils.videochatgpt_doc_to_answer
+process_results: !function utils.videochatgpt_process_results_temporal
+metric_list:
+  - metric: gpt_eval_score_temporal
+    aggregation: !function utils.videochatgpt_aggregate_score
+    higher_is_better: true
+include: _default_template_yaml
+
+generation_kwargs:
+  until:
+    - "ASSISTANT:"
+  image_aspect_ratio: original
+  max_new_tokens: 1024
+  temperature: 0
+  top_p: 1.0
+  num_beams: 1
+  do_sample: false
diff --git a/lmms_eval/tasks/videomme/utils.py b/lmms_eval/tasks/videomme/utils.py
new file mode 100755
index 00000000..3631f87c
--- /dev/null
+++ b/lmms_eval/tasks/videomme/utils.py
@@ -0,0 +1,234 @@
+from collections import defaultdict
+import os
+import datetime
+import json
+from lmms_eval.tasks._task_utils.file_utils import generate_submission_file
+
+import logging
+from pathlib import Path
+import yaml
+import sys
+from typing import List, Dict, Optional, Union
+import re
+
+eval_logger = logging.getLogger("lmms-eval")
+
+VIDEO_TYPE = ["short", "medium", "long"]
+CATEGORIES = ["Knowledge", "Film & Television", "Sports Competition", "Artistic Performance", "Life Record", "Multilingual"]
+
+SUB_CATEGORIES = [
+    "Humanity & History",
+    "Literature & Art",
+    "Biology & Medicine",
+    "Finance & Commerce",
+    "Astronomy",
+    "Geography",
+    "Law",
+    "Life Tip",
+    "Technology",
+    "Animation",
+    "Movie & TV Show",
+    "Documentary",
+    "News Report",
+    "Esports",
+    "Basketball",
+    "Football",
+    "Athletics",
+    "Other Sports",
+    "Stage Play",
+    "Magic Show",
+    "Variety Show",
+    "Acrobatics",
+    "Handicraft",
+    "Food",
+    "Fashion",
+    "Daily Life",
+    "Travel",
+    "Pet & Animal",
+    "Exercise",
+    "Multilingual",
+]
+
+TASK_CATEGORIES = [
+    "Temporal Perception",
+    "Spatial Perception",
+    "Attribute Perception",
+    "Action Recognition",
+    "Object Recognition",
+    "OCR Problems",
+    "Counting Problem",
+    "Temporal Reasoning",
+    "Spatial Reasoning",
+    "Action Reasoning",
+    "Object Reasoning",
+    "Information Synopsis",
+]
+
+replace_prompt = " Please answer yes or no."
+
+# with open(Path(__file__).parent / "_default_template_yaml", "r") as f:
+#     raw_data = f.readlines()
+#     safe_data = []
+#     for i, line in enumerate(raw_data):
+#         # remove function definition since yaml load cannot handle it
+#         if "!function" not in line:
+#             safe_data.append(line)
+
+#     config = yaml.safe_load("".join(safe_data))
+
+hf_home = os.getenv("HF_HOME", "~/.cache/huggingface/")
+# cache_dir = os.path.join(hf_home, cache_dir)
+# base_cache_dir = config["dataset_kwargs"]["cache_dir"]
+base_cache_dir = os.path.expanduser(hf_home)
+
+
+def videomme_doc_to_visual(doc):
+    with open(Path(__file__).parent / "videomme.yaml", "r") as f:
+        raw_data = f.readlines()
+        safe_data = []
+        for i, line in enumerate(raw_data):
+            # remove function definition since yaml load cannot handle it
+            if "!function" not in line:
+                safe_data.append(line)
+    cache_name = yaml.safe_load("".join(safe_data))["dataset_kwargs"]["cache_dir"]
+    cache_dir = os.path.join(base_cache_dir, cache_name)
+    video_path = doc["videoID"] + ".mp4"
+    video_path = os.path.join(cache_dir, video_path)
+    if os.path.exists(video_path):
+        video_path = video_path
+    elif os.path.exists(video_path.replace("mp4", "MP4")):
+        video_path = video_path.replace("mp4", "MP4")
+    elif os.path.exists(video_path.replace("mp4", "mkv")):
+        video_path = video_path.replace("mp4", "mkv")
+    else:
+        sys.exit(f"video path:{video_path} does not exist, please check")
+    return [video_path]
+
+
+def videomme_doc_to_text(doc, model_specific_prompt_kwargs=None):
+    question = doc["question"]
+    option = str(doc["options"])
+    question = question + "\n" + option + model_specific_prompt_kwargs["post_prompt"]
+    return question
+
+
+def extract_characters_regex(s):
+    s = s.strip()
+    answer_prefixes = [
+        "The best answer is",
+        "The correct answer is",
+        "The answer is",
+        "The answer",
+        "The best option is" "The correct option is",
+        "Best answer:" "Best option:",
+    ]
+    for answer_prefix in answer_prefixes:
+        s = s.replace(answer_prefix, "")
+
+    if len(s.split()) > 10 and not re.search("[ABCD]", s):
+        return ""
+
+    matches = re.search(r"[ABCD]", s)
+    if matches is None:
+        return ""
+    return matches[0]
+
+
+matrices = []
+
+for i in VIDEO_TYPE:
+    for j in CATEGORIES:
+        for k in SUB_CATEGORIES:
+            for l in TASK_CATEGORIES:
+                matrices.append(f"{i}_{j}_{k}_{l}")
+
+
+def videomme_process_results(doc, results):
+    """
+    Args:
+        doc: a instance of the eval dataset
+        results: [pred]
+    Returns:
+        a dictionary with key: metric name (in this case videomme score), value: metric value
+    """
+    pred = results[0]
+    pred_ans = extract_characters_regex(pred)
+    # gt_ans = doc["answer"].lower().strip().replace(".", "")
+
+    category = doc["domain"]
+    sub_category = doc["sub_category"]
+    task_category = doc["task_type"]
+    data_dict = {"question_id": doc["question_id"], "duration": doc["duration"], "category": category, "sub_category": sub_category, "task_category": task_category, "pred_answer": pred_ans, "answer": doc["answer"]}
+
+    # return {f"videomme_percetion_score": data_dict for metric in matrices}
+    return {f"videomme_percetion_score": data_dict}
+
+
+def videomme_aggregate_results(results):
+    """
+    Args:
+        results: a list of values returned by process_results
+    Returns:
+        A score
+    """
+    category2score = {}
+
+    for video_type in VIDEO_TYPE:
+        for category in CATEGORIES:
+            for sub_category in SUB_CATEGORIES:
+                for task_category in TASK_CATEGORIES:
+                    key = f"{video_type}_{category}_{sub_category}_{task_category}"
+                    category2score[key] = {"correct": 0, "answered": 0}
+
+    for result in results:
+        video_type = result["duration"]
+        category = result["category"]
+        sub_category = result["sub_category"]
+        task_category = result["task_category"]
+        key = f"{video_type}_{category}_{sub_category}_{task_category}"
+        category2score[key]["answered"] += 1
+        category2score[key]["correct"] += result["pred_answer"] == result["answer"]
+
+    for video_type in VIDEO_TYPE:
+        total_correct = 0
+        total_answered = 0
+        for k, v in category2score.items():
+            if video_type in k:
+                total_correct += v["correct"]
+                total_answered += v["answered"]
+        eval_logger.info(f"Evaluation on video Type: {video_type}: {100 * total_correct / total_answered if total_answered > 0 else 0 : .1f}%")
+
+    for category in CATEGORIES:
+        total_correct = 0
+        total_answered = 0
+        for k, v in category2score.items():
+            if category in k:
+                total_correct += v["correct"]
+                total_answered += v["answered"]
+        eval_logger.info(f"Evaluation on Categories: {category}: {100 * total_correct / total_answered if total_answered > 0 else 0 : .1f}%")
+
+    for sub_cate in SUB_CATEGORIES:
+        total_correct = 0
+        total_answered = 0
+        for k, v in category2score.items():
+            if sub_cate in k:
+                total_correct += v["correct"]
+                total_answered += v["answered"]
+        eval_logger.info(f"Evaluation on Video Sub Categories: {sub_cate}: {100 * total_correct / total_answered if total_answered > 0 else 0 : .1f}%")
+
+    for task_cate in TASK_CATEGORIES:
+        total_correct = 0
+        total_answered = 0
+        for k, v in category2score.items():
+            if task_cate in k:
+                total_correct += v["correct"]
+                total_answered += v["answered"]
+        eval_logger.info(f"Evaluation on Task Categories: {task_cate}: {100 * total_correct / total_answered if total_answered > 0 else 0 : .1f}%")
+
+    total_correct = 0
+    total_answered = 0
+    for k, v in category2score.items():
+        total_correct += v["correct"]
+        total_answered += v["answered"]
+    eval_logger.info(f"Overall Performance: {100 * total_correct / total_answered if total_answered > 0 else 0 : .1f}%")
+    return 100 * total_correct / total_answered if total_answered > 0 else 0
diff --git a/lmms_eval/tasks/videomme/videomme.yaml b/lmms_eval/tasks/videomme/videomme.yaml
new file mode 100755
index 00000000..7689da73
--- /dev/null
+++ b/lmms_eval/tasks/videomme/videomme.yaml
@@ -0,0 +1,43 @@
+dataset_path: lmms-lab/Video-MME
+dataset_kwargs:
+  token: True
+  cache_dir: videomme
+  video: True
+  # From_YouTube: True
+task: videomme
+test_split: test
+output_type: generate_until
+doc_to_visual: !function utils.videomme_doc_to_visual
+doc_to_text: !function utils.videomme_doc_to_text
+doc_to_target: "answer"
+generation_kwargs:
+  max_new_tokens: 16
+  temperature: 0
+  top_p: 1.0
+  num_beams: 1
+  do_sample: false
+# The return value of process_results will be used by metrics
+process_results: !function utils.videomme_process_results
+# Note that the metric name can be either a registed metric function (such as the case for GQA) or a key name returned by process_results
+metric_list:
+  - metric: videomme_percetion_score
+    aggregation: !function utils.videomme_aggregate_results
+    higher_is_better: true
+model_specific_prompt_kwargs:
+  default:
+    pre_prompt: ""
+    post_prompt: "\nAnswer the question with A, B, C, or D."
+  gpt4v:
+    pre_prompt: ""
+    post_prompt: "\nAnswer the question with A, B, C, or D."
+  # qwen_vl:  
+  #   pre_prompt: ""
+  #   post_prompt: " Answer:"
+  # otterhd:
+  #   pre_prompt: ""
+  #   post_prompt: " Answer:"
+  xcomposer2_4khd:
+    pre_prompt: "[UNUSED_TOKEN_146]user\n"
+    post_prompt: " Answer this question with A, B, C, or D.[UNUSED_TOKEN_145]\n[UNUSED_TOKEN_146]assistant\n"
+metadata:
+  - version: 0.0
diff --git a/lmms_eval/tasks/vizwiz_vqa/_default_template_vqa_yaml b/lmms_eval/tasks/vizwiz_vqa/_default_template_vqa_yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/vizwiz_vqa/_generate_config.py b/lmms_eval/tasks/vizwiz_vqa/_generate_config.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/vizwiz_vqa/_vizwiz_vqa.yaml b/lmms_eval/tasks/vizwiz_vqa/_vizwiz_vqa.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/vizwiz_vqa/utils.py b/lmms_eval/tasks/vizwiz_vqa/utils.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/vizwiz_vqa/vizwiz_vqa_test.yaml b/lmms_eval/tasks/vizwiz_vqa/vizwiz_vqa_test.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/vizwiz_vqa/vizwiz_vqa_val.yaml b/lmms_eval/tasks/vizwiz_vqa/vizwiz_vqa_val.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/vqav2/_default_template_vqav2_yaml b/lmms_eval/tasks/vqav2/_default_template_vqav2_yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/vqav2/_vqav2.yaml b/lmms_eval/tasks/vqav2/_vqav2.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/vqav2/utils.py b/lmms_eval/tasks/vqav2/utils.py
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/vqav2/vqav2_test.yaml b/lmms_eval/tasks/vqav2/vqav2_test.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/vqav2/vqav2_val.yaml b/lmms_eval/tasks/vqav2/vqav2_val.yaml
old mode 100644
new mode 100755
diff --git a/lmms_eval/tasks/worldqa/_default_template_yaml b/lmms_eval/tasks/worldqa/_default_template_yaml
new file mode 100644
index 00000000..050f7461
--- /dev/null
+++ b/lmms_eval/tasks/worldqa/_default_template_yaml
@@ -0,0 +1,8 @@
+dataset_path: lmms-lab/worldqa
+dataset_kwargs:
+  token: True
+  video: True
+  cache_dir: multi-hop-reasoning 
+metadata:
+  version: 0.0
+  gpt_eval_model_name: "gpt-4-0613"
\ No newline at end of file
diff --git a/lmms_eval/tasks/worldqa/utils.py b/lmms_eval/tasks/worldqa/utils.py
new file mode 100755
index 00000000..aa528432
--- /dev/null
+++ b/lmms_eval/tasks/worldqa/utils.py
@@ -0,0 +1,297 @@
+import re
+import os
+import sys
+import datetime
+import lmms_eval.tasks._task_utils.file_utils as file_utils
+from lmms_eval.filters.extraction import ExtendedRegexFilter
+from lmms_eval.tasks.worldqa.worldqa_mc_evaluator import WorldQA_MC_Evaluator
+import json
+import logging
+import yaml
+from pathlib import Path
+import requests
+import time
+
+NUM_SECONDS_TO_SLEEP = 5
+
+with open(Path(__file__).parent / "_default_template_yaml", "r") as f:
+    raw_data = f.readlines()
+    safe_data = []
+    for i, line in enumerate(raw_data):
+        # remove function definition since yaml load cannot handle it
+        if "!function" not in line:
+            safe_data.append(line)
+
+    config = yaml.safe_load("".join(safe_data))
+
+GPT_EVAL_MODEL_NAME = config["metadata"]["gpt_eval_model_name"]
+
+API_TYPE = os.getenv("API_TYPE", "openai")
+
+if API_TYPE == "openai":
+    API_URL = os.getenv("OPENAI_API_URL", "https://api.openai.com/v1/chat/completions")
+    API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_API_KEY")
+    headers = {
+        "Authorization": f"Bearer {API_KEY}",
+        "Content-Type": "application/json",
+    }
+elif API_TYPE == "azure":
+    API_URL = os.getenv("AZURE_ENDPOINT", "https://api.cognitive.microsoft.com/sts/v1.0/issueToken")
+    API_KEY = os.getenv("AZURE_API_KEY", "YOUR_API_KEY")
+    headers = {
+        "api-key": API_KEY,
+        "Content-Type": "application/json",
+    }
+
+eval_prompt = """You are an AI assistant who will help me to evaluate the quality of the candidate responses belonging to a question. The quality of the responses should be referred to the ground truth response.
+
+Some criterion
+- Response that perfectly reflect the key points in the ground truth: 1 point
+- Response that reflect none of the key points in the ground truth: 0 point
+- Some part in the response are correct but other parts in the response are contrast to the ground truth: 0.3 point
+- Some part in the response are correct but some parts in the ground truth are not mentioned in the response: 0.5 point
+- Some part in the response are correct but other parts in the response are not mentioned in the ground truth: 0.5 point
+
+Your output should be in the following format:
+Keypoint in the ground truth response:
+XXX
+Rationale:
+XXXX
+Point:
+1/0.5/0.3/0
+
+Let's begin this task:
+question: {question}
+ground truth: {answer}
+candidate: {candidate}
+"""
+
+
+def get_eval(question: str, ground_truth: str, candidate: str, max_tokens: int, retries: int = 5):
+    global headers
+
+    content = eval_prompt.format(question=question, answer=ground_truth, candidate=candidate)
+
+    messages = [
+        {"role": "user", "content": content},
+    ]
+
+    payload = {
+        "model": GPT_EVAL_MODEL_NAME,
+        "messages": messages,
+        "temperature": 0.2,
+        "max_tokens": max_tokens,
+    }
+
+    for attempt in range(retries):
+        try:
+            response = requests.post(API_URL, headers=headers, json=payload, timeout=60)
+            response.raise_for_status()
+            response_data = response.json()
+
+            content = response_data["choices"][0]["message"]["content"].strip()
+            if content != "":
+                return content, response_data["model"]
+            break  # If successful, break out of the loop
+
+        except Exception as e:
+            eval_logger.info(f"Attempt {attempt + 1} failed with error: {e}")
+            if attempt < retries:  # If we have retries left, sleep and then continue to next attempt
+                time.sleep(NUM_SECONDS_TO_SLEEP)
+            else:  # If this was the last attempt, log and return empty
+                eval_logger.error(f"All {retries} attempts failed. Last error message: {e}")
+                return "", ""
+    return "", ""
+
+
+# A bit ugly here
+# But the idea is that we will unzip all the zip files
+# To HF HOME cache dir
+# And load it here
+HF_HOME = os.getenv("HF_HOME", "~/.cache/huggingface/")
+cache_dir = config["dataset_kwargs"]["cache_dir"]
+cache_dir = os.path.join(HF_HOME, cache_dir)
+cache_dir = os.path.join(cache_dir, "videos")
+
+
+eval_logger = logging.getLogger("lmms-eval")
+
+
+# Pass in video path here
+# Can only work correctly with video llm
+def worldqa_doc_to_visual(doc):
+    video_path = doc["video_idx"] + ".mp4"
+    video_path = os.path.join(cache_dir, video_path)
+    if os.path.exists(video_path):
+        video_path = video_path
+    elif os.path.exists(video_path.replace("mp4", "MP4")):
+        video_path = video_path.replace("mp4", "MP4")
+    else:
+        sys.exit(f"video path:{video_path} does not exist, please check")
+    return [video_path]
+
+
+# This is the place where you format your question
+def worldqa_doc_to_text(doc, model_specific_prompt_kwargs=None):
+    if model_specific_prompt_kwargs is None:
+        model_specific_prompt_kwargs = {}
+    pre_prompt = ""
+    post_prompt = ""
+    if "pre_prompt" in model_specific_prompt_kwargs:
+        pre_prompt = model_specific_prompt_kwargs["pre_prompt"]
+    if "post_prompt" in model_specific_prompt_kwargs:
+        post_prompt = model_specific_prompt_kwargs["post_prompt"]
+
+    question = doc["question"]
+    if "option" in doc:
+        for op in doc["option"]:
+            question += "\n" + op
+        # post_prompt = "\nAnswer with the option's letter from the given choices directly."
+
+    return f"{pre_prompt}{question}{post_prompt}"
+
+
+def worldqa_doc_to_answer(doc):
+    return doc["answer"]
+
+
+# If it is mc, keep the option for exact match
+def worldqa_doc_to_answer_mc(doc):
+    return doc["answer"].split(".")[0].strip()
+
+
+# If it is mc ppl, keep the option str for perplexity base matching
+def worldqa_doc_to_answer_mc_ppl(doc):
+    return doc["answer"].split(".")[1].strip()
+
+
+# An example of showing how to custom metric
+# Your metric name should have the same key name in your return dict
+def worldqa_process_results(doc, result):
+    pred = result[0]
+    content = eval_prompt.format(question=doc["question"], answer=doc["answer"], candidate=pred)
+    eval_answer, model_name = get_eval(question=doc["question"], ground_truth=doc["answer"], candidate=pred, max_tokens=1024)
+    return {
+        "submission": {"pred": pred, "question_idx": doc["question_idx"], "object_description": doc["object_description"], "answer": doc["answer"], "eval_answer": eval_answer, "gpt_prompt": content},
+        "gpt_eval": {"pred": pred, "question_idx": doc["question_idx"], "object_description": doc["object_description"], "answer": doc["answer"], "eval_answer": eval_answer, "gpt_prompt": content},
+    }
+
+
+def worldqa_process_results_mc(doc, result):
+    pred = result[0]
+    data = {
+        "gpt_eval": {"pred": pred, "question_idx": doc["question_idx"], "object_description": doc["object_description"], "answer": doc["answer"], "option": doc["option"], "question": doc["question"]},
+    }
+    return data
+
+
+def worldqa_aggregate_mc_eval(results):
+    score = 0
+    evaluator = WorldQA_MC_Evaluator(API_KEY=API_KEY, API_URL=API_URL)
+    for result in results:
+        score += evaluator.evaluate(result)
+    return score / len(results)
+
+
+def worldqa_aggregate_submissions(results, args, task):
+    now_date_time = datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
+    submission_file_name = f"worldqa-{task}-{now_date_time}.json"
+    path = file_utils.generate_submission_file(submission_file_name, args)
+    with open(path, "w") as f:
+        json.dump(results, f)
+    eval_logger.info(f"Submission file saved to {path}")
+
+
+def worldq_gen_gpt_eval(results, args):
+    score = 0
+    for result in results:
+        eval_answer = result["eval_answer"]
+        eval_score = eval_answer.split("\n")[-1].strip()
+        try:
+            eval_score = float(eval_score)
+        except:
+            eval_score = 0.0
+        score += eval_score
+
+    return score / len(results)
+
+
+# Factory into different aggregate
+def worldqa_aggregate_gen(results, args):
+    worldqa_aggregate_submissions(results, args, "Generation")
+
+
+def worldqa_aggregate_mc(results, args):
+    worldqa_aggregate_submissions(results, args, "MC")
+
+
+def worldqa_aggregate_mc_ppl(results, args):
+    worldqa_aggregate_submissions(results, args, "MC_PPL")
+
+
+def worldqa_doc_to_choice(doc):
+    return [op.split(".")[1].strip() for op in doc["option"]]
+
+
+class MultiChoiceRegexFilter(ExtendedRegexFilter):
+    def __init__(self, *args, **kwargs):
+        """
+        regex_pattern: The basic regex pattern to use. If fails to match, we will use the customized match procedure
+                        - step 1 : We parse the choices between ([A-Z])s then try to find these choices in the response.
+                        - step 2 : We parse the choice with regex :[\s]*([A-?]), where ? varies by number of choices.
+        group_select: Selects the (group_select)th match from the findall result.
+        ignore_case: Ignores the case during step 1 matching
+        ignore_punctuation: Remove the punctuation during step 1 matching
+        regexes_to_ignore: Remove these regexes during step 1 matching
+        """
+        super().__init__(*args, **kwargs)
+
+    def apply(self, resps, docs):
+        # here, we assume we have a list, in which each element is
+        # a list of model responses for some particular input/target pair.
+        # so we process each of these (same input/target response sets)
+        # independently (and keep them a list.)
+
+        filtered_resps = []
+
+        for r, doc in zip(resps, docs):
+            fallback_regexes = []
+            choice_to_alpha = {}
+            next_alpha = "A"
+
+            question = doc["question"]
+            if "option" in doc:
+                for op in doc["option"]:
+                    question += "\n" + op
+            # Regex to extract multiple choice options from the question
+            multiple_choices_regex = re.compile(r"\b([A-Z])\.\s+([^\n]*)")
+            matches = multiple_choices_regex.findall(question)
+
+            # Build regex patterns and mappings for each choice
+            for m in matches:
+                choice_text = m[1].strip()
+                fallback_regexes.append(f"{re.escape(choice_text)}")
+                choice_to_alpha[choice_text] = next_alpha
+
+                next_alpha = chr(ord(next_alpha) + 1)
+
+            # Compile regex to match any of the extracted choices
+            fallback_regex = re.compile("|".join(fallback_regexes))
+
+            # Process each response
+            filtered = []
+            for resp in r:
+                # Remove any punctuation and extra spaces
+                cleaned_resp = re.sub(r"[^\w\s]", "", resp).strip()
+                # Try to match cleaned response with the choice text
+                match = fallback_regex.search(cleaned_resp)
+                if match and match.group() in choice_to_alpha:
+                    # Map the matched choice text back to its corresponding letter
+                    filtered.append(choice_to_alpha[match.group()])
+                else:
+                    # If no match, return the cleaned response
+                    filtered.append(cleaned_resp)
+
+            filtered_resps.append(filtered[0])
+
+        return filtered_resps
diff --git a/lmms_eval/tasks/worldqa/worldqa.yaml b/lmms_eval/tasks/worldqa/worldqa.yaml
new file mode 100644
index 00000000..4b77db4d
--- /dev/null
+++ b/lmms_eval/tasks/worldqa/worldqa.yaml
@@ -0,0 +1,5 @@
+group: worldqa
+task:
+- worldqa_gen
+- worldqa_mc
+
diff --git a/lmms_eval/tasks/worldqa/worldqa_generation.yaml b/lmms_eval/tasks/worldqa/worldqa_generation.yaml
new file mode 100755
index 00000000..f5ed41e9
--- /dev/null
+++ b/lmms_eval/tasks/worldqa/worldqa_generation.yaml
@@ -0,0 +1,20 @@
+dataset_name: "Generation"
+task: "worldqa_gen"
+test_split: test
+output_type: generate_until
+doc_to_visual: !function utils.worldqa_doc_to_visual
+doc_to_text: !function utils.worldqa_doc_to_text
+doc_to_target: !function utils.worldqa_doc_to_answer
+process_results: !function utils.worldqa_process_results
+metric_list:
+  - metric: submission
+    aggregation: !function utils.worldqa_aggregate_gen
+    higher_is_better: true
+  - metric: gpt_eval
+    aggregation: !function utils.worldq_gen_gpt_eval
+    higher_is_better: true  
+model_specific_prompt_kwargs:
+  default:
+    pre_prompt: ""
+    post_prompt: ""
+include: _default_template_yaml
diff --git a/lmms_eval/tasks/worldqa/worldqa_mc.yaml b/lmms_eval/tasks/worldqa/worldqa_mc.yaml
new file mode 100755
index 00000000..821db87f
--- /dev/null
+++ b/lmms_eval/tasks/worldqa/worldqa_mc.yaml
@@ -0,0 +1,26 @@
+dataset_name: "MC"
+task: "worldqa_mc"
+test_split: test
+output_type: generate_until
+doc_to_visual: !function utils.worldqa_doc_to_visual
+doc_to_text: !function utils.worldqa_doc_to_text
+doc_to_target: !function utils.worldqa_doc_to_answer_mc
+process_results: !function utils.worldqa_process_results_mc
+metric_list:
+  - metric: gpt_eval
+    aggregation: !function utils.worldqa_aggregate_mc_eval
+    higher_is_better: true
+model_specific_prompt_kwargs:
+  default:
+    pre_prompt: ""
+    post_prompt: "\nAnswer with the option's letter from the given choices directly."
+filter_list:
+  - name: "flexible-extract"
+    filter:
+      - function: !function utils.MultiChoiceRegexFilter
+        group_select: 0
+        ignore_case: true
+        ignore_punctuation: true
+        regex_pattern: "(\\([A-Z]\\))"
+
+include: _default_template_yaml
diff --git a/lmms_eval/tasks/worldqa/worldqa_mc_evaluator.py b/lmms_eval/tasks/worldqa/worldqa_mc_evaluator.py
new file mode 100644
index 00000000..e1e2ae76
--- /dev/null
+++ b/lmms_eval/tasks/worldqa/worldqa_mc_evaluator.py
@@ -0,0 +1,118 @@
+import os.path as osp
+import time
+import random as rd
+import string
+from collections import defaultdict
+import requests
+
+import math
+import numpy as np
+import pandas as pd
+from tqdm import tqdm
+
+import logging
+
+eval_logger = logging.getLogger("lmms-eval")
+
+
+class WorldQA_MC_Evaluator:
+    def __init__(self, sys_prompt="There are several options:", API_KEY="", API_URL="", model_version="gpt-3.5-turbo-0613"):
+        self.sys_prompt = sys_prompt
+        self.model_version = model_version
+        self.API_KEY = API_KEY
+        self.API_URL = API_URL
+
+    def build_prompt(self, question, options, prediction):
+        tmpl = (
+            "You are an AI assistant who will help me to match an answer "
+            "with several options of a single-choice question. "
+            "You are provided with a question, several options, and an answer, "
+            "and you need to find which option is most similar to the answer. "
+            "If the meaning of all options are significantly different "
+            "from the answer, output E. "
+            "Your should output a single uppercase character in A, B, C, D "
+            "(if they are valid options), and E. \n"
+            "Example 1: \n"
+            "Question: What is the main object in image?\nOptions: A. teddy bear "
+            "B. rabbit C. cat D. dog\nAnswer: a cute teddy bear\nYour output: A\n"
+            "Example 2: \n"
+            "Question: What is the main object in image?\nOptions: A. teddy bear "
+            "B. rabbit C. cat D. dog\nAnswer: Spider\nYour output: E\n"
+            "Example 3: \n"
+            "Question: {}?\nOptions: {}\nAnswer: {}\nYour output: "
+        )
+        return tmpl.format(question, options, prediction)
+
+    # Prefetch Answers
+    def can_infer_option(self, answer, num_choice=5):
+        choices = string.ascii_uppercase[:num_choice]
+        if "Failed to obtain answer via API" in answer:
+            return False
+
+        def count(splits, choices="ABCD", prefix="", suffix=""):
+            cnt = 0
+            for c in choices:
+                if prefix + c + suffix in splits:
+                    cnt += 1
+            return cnt
+
+        splits = [x.strip() for x in answer.split()]
+        if count(splits, choices) == 1:
+            for ch in choices:
+                if "A" in splits and len(splits) > 3:
+                    eval_logger.info(f"A might be a quantifier in the string: {answer}.")
+                    break
+                if ch in splits:
+                    return ch
+        tups = [("", "."), ("", ","), ("", ":"), ("", ")"), ("", ")."), ("(", ")"), ("(", ")."), (":", ""), (":", ","), (":", "."), (":", ")"), (":", ").")]
+        for tup in tups:
+            if count(splits, choices, prefix=tup[0], suffix=tup[1]) == 1:
+                for ch in choices:
+                    if tup[0] + ch + tup[1] in splits:
+                        return ch
+        return False
+
+    def _post_request(self, payload):
+        headers = {
+            "Authorization": f"Bearer {self.API_KEY}",
+            "Content-Type": "application/json",
+        }
+        response = requests.post(self.API_URL, headers=headers, json=payload, timeout=30)
+        response.raise_for_status()
+        return response.json()
+
+    def get_chat_response(self, prompt, temperature=0, max_tokens=256, n=1, patience=5, sleep_time=3):
+        messages = [
+            {"role": "user", "content": prompt},
+        ]
+        payload = {"model": self.model_version, "messages": messages, "temperature": temperature, "max_tokens": max_tokens, "n": n}
+
+        while patience > 0:
+            patience -= 1
+            try:
+                response = self._post_request(payload)
+                if n == 1:
+                    prediction = response["choices"][0]["message"]["content"].strip()
+                    if prediction and prediction != "":
+                        return prediction
+                else:
+                    prediction = [choice["message"]["content"].strip() for choice in response["choices"]]
+                    if prediction and prediction[0] != "":
+                        return prediction
+
+            except Exception as e:
+                eval_logger.info(f"Attempt {patience + 1} failed with error: {e}")
+                if sleep_time > 0:
+                    time.sleep(sleep_time)
+
+        return "Failed to obtain answer via API"
+
+    def evaluate(self, results):
+        answer = results["answer"].split(".")[0]
+        if self.can_infer_option(results["pred"], num_choice=4):
+            choice = self.can_infer_option(results["pred"], num_choice=4)
+            return int(choice.lower().strip() == answer.lower().strip())
+        else:
+            prompt = self.build_prompt(question=results["question"], options="\n".join(results["option"]), prediction=results["pred"])
+            prediction = self.get_chat_response(prompt)
+            return int(prediction.lower().strip() == answer.lower().strip())
diff --git a/lmms_eval/tasks/worldqa/worldqa_mcppl.yaml b/lmms_eval/tasks/worldqa/worldqa_mcppl.yaml
new file mode 100755
index 00000000..f954bf4c
--- /dev/null
+++ b/lmms_eval/tasks/worldqa/worldqa_mcppl.yaml
@@ -0,0 +1,15 @@
+dataset_name: "MC_PPL"
+task: "worldqa_mc_ppl"
+test_split: test
+output_type: multiple_choice
+doc_to_visual: !function utils.worldqa_doc_to_visual
+doc_to_text: "question"
+doc_to_target: !function utils.worldqa_doc_to_answer_mc_ppl
+doc_to_choice: !function utils.worldqa_doc_to_choice
+metric_list:
+  - metric: acc
+model_specific_prompt_kwargs:
+  default:
+    pre_prompt: ""
+    post_prompt: ""
+include: _default_template_yaml
diff --git a/lmms_eval/tasks/youcook2/_default_template_yaml b/lmms_eval/tasks/youcook2/_default_template_yaml
new file mode 100644
index 00000000..7d3e95ac
--- /dev/null
+++ b/lmms_eval/tasks/youcook2/_default_template_yaml
@@ -0,0 +1,5 @@
+dataset_path: lmms-lab/YouCook2
+dataset_kwargs:
+  token: True
+  video: True
+  cache_dir: YouCookIIVideos
diff --git a/lmms_eval/tasks/youcook2/utils.py b/lmms_eval/tasks/youcook2/utils.py
new file mode 100644
index 00000000..1761a8b1
--- /dev/null
+++ b/lmms_eval/tasks/youcook2/utils.py
@@ -0,0 +1,132 @@
+import os
+import yaml
+import logging
+import string
+import random
+import numpy as np
+
+from pathlib import Path
+from pycocoevalcap.eval import COCOEvalCap, Bleu, Meteor, Rouge, Cider, Spice
+from pycocoevalcap.tokenizer.ptbtokenizer import PTBTokenizer
+from pycocotools.coco import COCO
+
+from lmms_eval.tasks._task_utils.file_utils import generate_submission_file
+from lmms_eval.tasks._task_utils.video_loader import get_cache_dir
+
+COCO_METRICS = ["Bleu_4", "Bleu_3", "Bleu_2", "Bleu_1", "METEOR", "ROUGE_L", "CIDEr"]  # , "SPICE"]
+
+eval_logger = logging.getLogger("lmms-eval")
+
+
+def remove_nonascii(text):
+    return "".join([i if ord(i) < 128 else " " for i in text])
+
+
+def random_string(string_length):
+    letters = string.ascii_lowercase
+    return "".join(random.choice(letters) for i in range(string_length))
+
+
+with open(Path(__file__).parent / "_default_template_yaml", "r") as f:
+    raw_data = f.readlines()
+    safe_data = []
+    for i, line in enumerate(raw_data):
+        # remove function definition since yaml load cannot handle it
+        if "!function" not in line:
+            safe_data.append(line)
+
+    config = yaml.safe_load("".join(safe_data))
+
+cache_dir = get_cache_dir(config, "YouCookIIVideos")
+
+
+def youcook2_doc_to_visual(doc):
+    return [os.path.join(cache_dir, doc["video_path"])]
+
+
+def youcook2_doc_to_text(doc, model_specific_prompt_kwargs=None):
+    if model_specific_prompt_kwargs and "prompt" in model_specific_prompt_kwargs:
+        return model_specific_prompt_kwargs["prompt"]
+    else:
+        return "Provide a one-sentence caption for the provided video."
+
+
+def youcook2_process_results(doc, result):
+    pred = result[0] if result else ""
+    video = doc["youtube_id"]
+    timestamp = doc["segment"]
+
+    data_dict = {"answer": remove_nonascii(doc["sentence"]), "pred": remove_nonascii(pred), "video": video, "timestamp": timestamp}
+
+    return {f"{metric}": data_dict for metric in COCO_METRICS}
+
+
+def youcook2_aggregate_results(results, metric, **kwargs):
+    scorers = [(Bleu(4), "Bleu_1"), (Bleu(4), "Bleu_2"), (Bleu(4), "Bleu_3"), (Bleu(4), "Bleu_4"), (Meteor(), "METEOR"), (Rouge(), "ROUGE_L"), (Cider(), "CIDEr")]  # , (Spice(), "SPICE")]
+    scorers_dict = {s[1]: s[0] for s in scorers}
+
+    gts = {}
+    res = {}
+    vid2capid = {}
+    uid = 0
+    cur_gts = {}
+    cur_res = {}
+    for result in results:
+        if result["video"] not in vid2capid:
+            vid2capid[result["video"]] = []
+        vid2capid[result["video"]].append(uid)
+        cur_gts[uid] = [{"caption": result["answer"]}]
+        cur_res[uid] = [{"caption": result["pred"]}]
+        uid += 1
+
+    eval_logger.info("tokenization...")
+    tokenizer = PTBTokenizer()
+    tokenize_gts = tokenizer.tokenize(cur_gts)
+    tokenize_res = tokenizer.tokenize(cur_res)
+
+    eval_logger.info(f"Computing {metric} scores...")
+    all_scores = []
+    scorer = scorers_dict[metric]
+
+    for vid_id, vid_list in vid2capid.items():
+        res = {index: tokenize_res[index] for index in vid_list}
+        gts = {index: tokenize_gts[index] for index in vid_list}
+
+        if len(gts) == 0 or len(res) == 0:
+            score = 0
+        else:
+            score, scores = scorer.compute_score(gts, res)
+        all_scores.append(score)
+    return np.mean(all_scores) * 100
+
+
+def youcook2_bleu4(results, **kwargs):
+    return youcook2_aggregate_results(results, "Bleu_4", **kwargs)
+
+
+def youcook2_bleu3(results, **kwargs):
+    return youcook2_aggregate_results(results, "Bleu_3", **kwargs)
+
+
+def youcook2_bleu2(results, **kwargs):
+    return youcook2_aggregate_results(results, "Bleu_2", **kwargs)
+
+
+def youcook2_bleu1(results, **kwargs):
+    return youcook2_aggregate_results(results, "Bleu_1", **kwargs)
+
+
+def youcook2_meteor(results, **kwargs):
+    return youcook2_aggregate_results(results, "METEOR", **kwargs)
+
+
+def youcook2_rougel(results, **kwargs):
+    return youcook2_aggregate_results(results, "ROUGE_L", **kwargs)
+
+
+def youcook2_cider(results, **kwargs):
+    return youcook2_aggregate_results(results, "CIDEr", **kwargs)
+
+
+def youcook2_spice(results, args):
+    return youcook2_aggregate_results(results, "SPICE", args)
diff --git a/lmms_eval/tasks/youcook2/youcook2_val.yaml b/lmms_eval/tasks/youcook2/youcook2_val.yaml
new file mode 100644
index 00000000..f574e6a3
--- /dev/null
+++ b/lmms_eval/tasks/youcook2/youcook2_val.yaml
@@ -0,0 +1,38 @@
+task: "youcook2_val"
+test_split: val
+output_type: generate_until
+doc_to_visual: !function utils.youcook2_doc_to_visual
+doc_to_text: !function utils.youcook2_doc_to_text
+doc_to_target: sentence
+process_results: !function utils.youcook2_process_results
+metric_list:
+  - metric: Bleu_4 
+    aggregation : !function utils.youcook2_bleu4
+    higher_is_better : true
+  - metric: Bleu_3
+    aggregation : !function utils.youcook2_bleu3
+    higher_is_better : true
+  - metric: Bleu_2
+    aggregation : !function utils.youcook2_bleu2
+    higher_is_better : true
+  - metric: Bleu_1
+    aggregation : !function utils.youcook2_bleu1
+    higher_is_better : true
+  - metric: METEOR
+    aggregation : !function utils.youcook2_meteor
+    higher_is_better : true
+  - metric: ROUGE_L
+    aggregation : !function utils.youcook2_rougel
+    higher_is_better : true
+  - metric: CIDEr
+    aggregation : !function utils.youcook2_cider
+    higher_is_better : true
+  #- metric: youcook2_SPICE
+  #  aggregation : !function utils.youcook2_spice
+  #  higher_is_better : true
+include: _default_template_yaml
+model_specific_prompt_kwargs:
+  default:
+    prompt: Provide a one-sentence caption for the provided video.
+  gemini_api:
+    prompt: Provide a brief single-sentence caption for the last video below. Do not give any reasoning, just the caption. You must follow the captioning style of the preceding videos. Do not start your response with "Output:", just provide the caption.
\ No newline at end of file
diff --git a/lmms_eval/utils.py b/lmms_eval/utils.py
old mode 100644
new mode 100755
index d6c08553..2e315fbe
--- a/lmms_eval/utils.py
+++ b/lmms_eval/utils.py
@@ -411,9 +411,12 @@ def make_table(result_dict, column: str = "results"):
 
             points = "N/A"
             if v is not None:
-                if 0 <= v <= 1:
-                    v *= 100
-                points = "%.4f" % v
+                if isinstance(v, str):
+                    points = v
+                else:
+                    # if 0 <= v <= 1:
+                    #     # v *= 100
+                    points = "%.4f" % v
 
             if m + "_stderr" + "," + f in dic:
                 if v is None:
diff --git a/miscs/example_eval.yaml b/miscs/example_eval.yaml
new file mode 100755
index 00000000..6750ee7a
--- /dev/null
+++ b/miscs/example_eval.yaml
@@ -0,0 +1,8 @@
+- model: llava
+  model_args: pretrained=liuhaotian/llava-v1.5-7b
+  tasks: mmmu_val
+  batch_size: 1
+  log_samples: true
+  log_samples_suffix: eval_mmmu
+  output_path: "./logs/"
+
diff --git a/llava_repr_requirements.txt b/miscs/llava_repr_requirements.txt
old mode 100644
new mode 100755
similarity index 73%
rename from llava_repr_requirements.txt
rename to miscs/llava_repr_requirements.txt
index e3f0f527..c09b6f5b
--- a/llava_repr_requirements.txt
+++ b/miscs/llava_repr_requirements.txt
@@ -27,23 +27,6 @@ shortuuid==1.0.12
 sqlitedict==2.1.0
 tenacity==8.2.3
 torch==2.0.1
-openai>=1.0.0
-pycocoevalcap
 tokenizers==0.15.2
 tqdm==4.66.2
-tqdm-multiprocess
 transformers==4.37.2
-zstandard
-pillow
-pyyaml
-sympy
-mpmath
-Jinja2
-openpyxl
-Levenshtein
-hf_transfer
-tenacity
-wandb>=0.16.0
-transformers-stream-generator
-tiktoken
-pre-commit
\ No newline at end of file
diff --git a/miscs/llava_result_check.md b/miscs/llava_result_check.md
old mode 100644
new mode 100755
diff --git a/miscs/llava_sglang_result_check.md b/miscs/llava_sglang_result_check.md
new file mode 100755
index 00000000..e69de29b
diff --git a/miscs/repr_scripts.sh b/miscs/repr_scripts.sh
old mode 100644
new mode 100755
diff --git a/miscs/repr_torch_envs.txt b/miscs/repr_torch_envs.txt
old mode 100644
new mode 100755
diff --git a/miscs/scienceqa_id.txt b/miscs/scienceqa_id.txt
old mode 100644
new mode 100755
diff --git a/miscs/script.sh b/miscs/script.sh
old mode 100644
new mode 100755
index b5d6cc6c..f0e63716
--- a/miscs/script.sh
+++ b/miscs/script.sh
@@ -18,4 +18,4 @@ gpu = 4 bs 1 use_flash_attention_2=True:
 
 
 
-accelerate launch --num_processes=8 -m lmms_eval --model llava   --model_args pretrained="liuhaotian/llava-v1.5-13b"   --tasks scienceqa  --batch_size 1 --log_samples --log_samples_sufix debug --output_path ./logs/
+accelerate launch --num_processes=8 --main_process_port 12345 -m lmms_eval --model qwen_vl   --model_args pretrained="Qwen/Qwen-VL"   --tasks mme  --batch_size 1 --log_samples --log_samples_suffix debug --output_path ./logs/
diff --git a/miscs/test_llava.py b/miscs/test_llava.py
old mode 100644
new mode 100755
diff --git a/miscs/test_scienceqa.py b/miscs/test_scienceqa.py
old mode 100644
new mode 100755
diff --git a/pyproject.toml b/pyproject.toml
old mode 100644
new mode 100755
index a8c8809d..d11c1228
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -7,7 +7,7 @@ build-backend = "setuptools.build_meta"
 
 [project]
 name = "lmms_eval"
-version = "0.1.2"
+version = "0.2.0.dev0"
 authors = [
     { name = "LMMMs-Lab Evaluation Team", email = "lmms_eval@outlook.com" },
 ]
@@ -21,10 +21,11 @@ classifiers = [
 requires-python = ">=3.8"
 license = { text = "MIT" }
 dependencies = [
-    "accelerate>=0.21.0",
+    "accelerate>=0.29.1",
     "black==24.1.0",
     "datasets==2.16.1",
     "evaluate>=0.4.0",
+    "httpx==0.23.3",
     "jsonlines",
     "numexpr",
     "peft>=0.2.0",
@@ -34,11 +35,14 @@ dependencies = [
     "sacrebleu>=1.5.0",
     "scikit-learn>=0.24.1",
     "sqlitedict",
-    "torch>=1.8",
+    "torch>=2.1.0", # to enable sdpa mode for running 34B model on one 80GB GPU
     "openai>=1.0.0",
+    "yt-dlp",
+    "google-generativeai",
     "pycocoevalcap",
     "tqdm-multiprocess",
-    "transformers",
+    "transformers>=4.37.2",
+    "transformers-stream-generator",
     "zstandard",
     "pillow",
     "pyyaml",
@@ -50,21 +54,54 @@ dependencies = [
     "hf_transfer",
     "tenacity",
     "wandb>=0.16.0",
-    "transformers-stream-generator",
     "tiktoken",
     "pre-commit",
     "pydantic",
+    "packaging",
+    "decord",
+    "zss",
+    "pywsd",
 ]
 
 [tool.setuptools.packages.find]
 include = ["lmms_eval*"]
+exclude = [
+    "assets*",
+    "benchmark*",
+    "docs",
+    "dist*",
+    "playground*",
+    "scripts*",
+    "tests*",
+    "checkpoints*",
+    "project_checkpoints*",
+    "debug_checkpoints*",
+    "mlx_configs*",
+    "wandb*",
+    "notebooks*",
+    "logs*",
+]
 
-[tool.setuptools.package-data]
-lmms_eval = ["**/*.yaml", "tasks/**/*"]
+[tool.wheel]
+exclude = [
+    "assets*",
+    "benchmark*",
+    "docs",
+    "dist*",
+    "playground*",
+    "scripts*",
+    "tests*",
+    "checkpoints*",
+    "project_checkpoints*",
+    "debug_checkpoints*",
+    "mlx_configs*",
+    "wandb*",
+    "notebooks*",
+    "logs*",
+]
 
 [project.scripts]
 lmms-eval = "lmms_eval.__main__:cli_evaluate"
-lmms_eval = "lmms_eval.__main__:cli_evaluate"
 
 [project.urls]
 Homepage = "https://lmms-lab.github.io/lmms-eval-blog/"
diff --git a/setup.py b/setup.py
old mode 100644
new mode 100755
diff --git a/tools/get_video_avg_time.py b/tools/get_video_avg_time.py
new file mode 100644
index 00000000..e08d37ae
--- /dev/null
+++ b/tools/get_video_avg_time.py
@@ -0,0 +1,60 @@
+import json
+from lmms_eval.tasks import initialize_tasks, include_path, get_task_dict
+import av
+from tqdm import tqdm
+from av.codec.context import CodecContext
+
+tasks = ["worldqa_gen", "activitynetqa", "nextqa_oe_val", "nextqa_oe_test", "videochatgpt_gen", "egoschema"]
+# tasks = ["nextqa_oe_val"]
+data_stats = {}
+
+
+# This one is faster
+def record_video_length_stream(container):
+    video = container.streams.video[0]
+    video_length = float(video.duration * video.time_base)  # in seconds
+    return video_length
+
+
+# This one works for all types of video
+def record_video_length_packet(container):
+    video_length = 0
+    # context = CodecContext.create("libvpx-vp9", "r")
+    for packet in container.demux(video=0):
+        for frame in packet.decode():
+            video_length = frame.time  # The last frame time represent the video time
+
+    return video_length
+
+
+if __name__ == "__main__":
+    initialize_tasks()
+
+    task_dict = get_task_dict(tasks, model_name="llavavid")
+    for task_name in task_dict.keys():
+        task_obj = task_dict[task_name]
+        if type(task_obj) == tuple:
+            group, task_obj = task_obj
+
+        docs = task_obj.test_docs()
+        doc_to_visual = task_obj.doc_to_visual
+        data_stats[task_name] = 0
+        for doc in tqdm(docs, desc=f"Processing {task_name}"):
+            video_path = doc_to_visual(doc)
+            container = av.open(video_path[0])
+
+            if "webm" not in video_path[0] and "mkv" not in video_path[0]:
+                try:
+                    video_length = record_video_length_stream(container)  # in seconds
+                except:
+                    video_length = record_video_length_packet(container)
+            else:
+                video_length = record_video_length_packet(container)
+
+            data_stats[task_name] += video_length
+
+        data_stats[task_name] /= len(docs)  # into seconds
+        # data_stats[task_name] /= 60 # into minutes
+
+    with open("./video_benchmarks_stats.json", "w") as f:
+        json.dump(data_stats, f, indent=4, ensure_ascii=False)
diff --git a/tools/make_hf_dataset.ipynb b/tools/make_hf_dataset.ipynb
old mode 100644
new mode 100755
index efbef00e..9336331a
--- a/tools/make_hf_dataset.ipynb
+++ b/tools/make_hf_dataset.ipynb
@@ -12,56 +12,51 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from datasets import load_dataset\n",
-    "\n",
-    "data_path = \"Otter-AI/MMVet\"\n",
-    "df = load_dataset(data_path, split=\"test\").to_pandas()\n",
-    "df.head()"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 6,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/tiger/miniconda3/envs/llava/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
+      "  from .autonotebook import tqdm as notebook_tqdm\n",
+      "100%|██████████| 499/499 [00:18<00:00, 26.87it/s]\n"
+     ]
+    }
+   ],
    "source": [
     "from datasets import Dataset, Features, Value, Image\n",
     "import pandas as pd\n",
+    "from tqdm import tqdm\n",
+    "import os\n",
     "\n",
     "# Define the features for the dataset\n",
     "features = Features(\n",
     "    {\n",
-    "        \"question_id\": Value(dtype=\"string\"),\n",
-    "        \"image\": Image(),\n",
+    "        \"video_name\": Value(dtype=\"string\"),\n",
     "        \"question\": Value(dtype=\"string\"),\n",
     "        \"answer\": Value(dtype=\"string\"),\n",
-    "        \"image_source\": Value(dtype=\"string\"),\n",
-    "        \"capability\": Value(dtype=\"string\"),\n",
-    "        # Add other fields as necessary\n",
     "    }\n",
     ")\n",
     "\n",
     "df_items = {\n",
-    "    \"question_id\": [],\n",
-    "    \"image\": [],\n",
+    "    \"video_name\": [],\n",
     "    \"question\": [],\n",
     "    \"answer\": [],\n",
-    "    \"image_source\": [],\n",
-    "    \"capability\": [],\n",
     "}\n",
     "\n",
-    "for idx, row in df.iterrows():\n",
-    "    df_items[\"question_id\"].append(str(row[\"id\"]))\n",
-    "    image = {\"bytes\": row[\"images\"][0][\"bytes\"], \"path\": \"\"}\n",
-    "    df_items[\"image\"].append(image)\n",
-    "    df_items[\"question\"].append(str(row[\"instruction\"]))\n",
-    "    df_items[\"answer\"].append(str(row[\"answer\"]))\n",
-    "    df_items[\"image_source\"].append(str(row[\"image_source\"]))\n",
-    "    df_items[\"capability\"].append(\",\".join(list(row[\"capability\"])))\n",
+    "description_root = \"/mnt/bn/vl-research/workspace/yhzhang/data/llava_video/video_detail_description/Test_Human_Annotated_Captions\"\n",
+    "videos = os.listdir(description_root)\n",
+    "for cur_video_name in tqdm(videos):\n",
+    "    sample_set = {}\n",
+    "    video_name = cur_video_name.split(\".\")[0]\n",
+    "    with open(f\"{description_root}/{cur_video_name}\", encoding=\"utf-8-sig\") as f:\n",
+    "        description = f.readlines()[0]\n",
+    "    question = \"Please provide a detailed description of the video, focusing on the main subjects, their actions, and the background scenes\"\n",
+    "    df_items[\"video_name\"].append(video_name)\n",
+    "    df_items[\"question\"].append(question)\n",
+    "    df_items[\"answer\"].append(description)\n",
     "    # Add other fields as necessary\n",
     "\n",
     "df_items = pd.DataFrame(df_items)"
@@ -69,16 +64,98 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 7,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>video_name</th>\n",
+       "      <th>question</th>\n",
+       "      <th>answer</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>v_-6dz6tBH77I</td>\n",
+       "      <td>Please provide a detailed description of the v...</td>\n",
+       "      <td>The video is of a man in athletic clothes stan...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>v_-D1gdv_gQyw</td>\n",
+       "      <td>Please provide a detailed description of the v...</td>\n",
+       "      <td>The video begins with a man holding a knife in...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>v_-HpCLXdtcas</td>\n",
+       "      <td>Please provide a detailed description of the v...</td>\n",
+       "      <td>A man is standing behind a barbell placed on t...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>v_-IMXSEIabMM</td>\n",
+       "      <td>Please provide a detailed description of the v...</td>\n",
+       "      <td>The video starts with two people standing behi...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>v_-MbZ-W0AbN0</td>\n",
+       "      <td>Please provide a detailed description of the v...</td>\n",
+       "      <td>The video starts with an advertisement for fur...</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "      video_name                                           question  \\\n",
+       "0  v_-6dz6tBH77I  Please provide a detailed description of the v...   \n",
+       "1  v_-D1gdv_gQyw  Please provide a detailed description of the v...   \n",
+       "2  v_-HpCLXdtcas  Please provide a detailed description of the v...   \n",
+       "3  v_-IMXSEIabMM  Please provide a detailed description of the v...   \n",
+       "4  v_-MbZ-W0AbN0  Please provide a detailed description of the v...   \n",
+       "\n",
+       "                                              answer  \n",
+       "0  The video is of a man in athletic clothes stan...  \n",
+       "1  The video begins with a man holding a knife in...  \n",
+       "2  A man is standing behind a barbell placed on t...  \n",
+       "3  The video starts with two people standing behi...  \n",
+       "4  The video starts with an advertisement for fur...  "
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
    "source": [
     "df_items.head()"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 8,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -87,13 +164,65 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 3,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Token is valid (permission: write).\n",
+      "Your token has been saved in your configured git credential helpers (store).\n",
+      "Your token has been saved to /home/tiger/.cache/huggingface/token\n",
+      "Login successful\n"
+     ]
+    }
+   ],
+   "source": [
+    "!huggingface-cli login --token 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 --add-to-git-credential"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 340.67ba/s]\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Uploading the dataset shards: 100%|██████████| 1/1 [00:00<00:00,  2.46it/s]\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "CommitInfo(commit_url='https://huggingface.co/datasets/lmms-lab/VideoDetailDescription/commit/ad8e58fa42ad8daf56808724a4bcf4724688194e', commit_message='Upload dataset', commit_description='', oid='ad8e58fa42ad8daf56808724a4bcf4724688194e', pr_url=None, pr_revision=None, pr_num=None)"
+      ]
+     },
+     "execution_count": 9,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
    "source": [
-    "hub_dataset_path = \"lmms-lab/MMVet\"\n",
+    "hub_dataset_path = \"lmms-lab/VideoDetailDescription\"\n",
     "dataset.push_to_hub(repo_id=hub_dataset_path, split=\"test\")"
    ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
   }
  ],
  "metadata": {
@@ -112,7 +241,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.18"
+   "version": "3.10.14"
   }
  },
  "nbformat": 4,
diff --git a/tools/makecvrr.ipynb b/tools/makecvrr.ipynb
new file mode 100644
index 00000000..6b364a66
--- /dev/null
+++ b/tools/makecvrr.ipynb
@@ -0,0 +1,663 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 49,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# This notebook will guide you to make correct format of Huggingface dataset, in proper parquet format and visualizable in Huggingface dataset hub.\n",
+    "# We will take the example of the dataset \"Otter-AI/MMVet\" and convert it to the proper format."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 50,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/mnt/sfs-common/krhu/miniconda3/envs/lmms_eval/lib/python3.9/site-packages/datasets/load.py:922: FutureWarning: The repository for build_cvrr contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at /mnt/sfs-common/krhu/lmms-eval-internal/build_cvrr.py\n",
+      "You can avoid this message in future by passing the argument `trust_remote_code=True`.\n",
+      "Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.\n",
+      "  warnings.warn(\n",
+      "Repo card metadata block was not found. Setting CardData to empty.\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Generating test split: 152 examples [00:00, 9291.47 examples/s]"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'time_order_understanding': {'0': {'Q': 'Does the video depict a person performing an activity in the conventional direction along the track?', 'A': 'No, the person is running backwards, which challenges the common expectation of direction in track running activities.', 'DimensionName': 'Time order understanding', 'VideoID': '204.mp4'}, '1': {'Q': 'What is happening in the video?', 'A': 'The video features a person in a blue running outfit and sunglasses running backward on an athletics track. The footage is fast-forwarded, highlighting the continuous backward running, with a backdrop of a football stadium, hurdles, and tiered stands.', 'DimensionName': 'Time order understanding', 'VideoID': '204.mp4'}, '2': {'Q': \"Is the person's gear more suitable for indoor activities or outdoor sports?\", 'A': \"The person's gear, consisting of a blue running kit and black sunglasses, is more suitable for outdoor activities, indicating the appropriateness of the attire for the context of an athletics track.\", 'DimensionName': 'Time order understanding', 'VideoID': '204.mp4'}, '3': {'Q': 'Is the person running towards the camera or running in the direction away from the camera?', 'A': 'The person is running away from the camera by running backwards on the race track.', 'DimensionName': 'Time order understanding', 'VideoID': '204.mp4'}, '4': {'Q': 'Is the person running in clockwise direction or anticlockwise direction on the race track?', 'A': 'The person is running in anticlockwise direction in the video.', 'DimensionName': 'Time order understanding', 'VideoID': '204.mp4'}, '5': {'Q': 'Does the person drink anything in the video while running?', 'A': 'No, the person is not shown to be drinking anything in the video while performing the running activity.', 'DimensionName': 'Time order understanding', 'VideoID': '204.mp4'}, '6': {'Q': 'Is the video capturing an event happening indoors or outdoors?', 'A': 'The video captures an outdoor event, suggested by the outdoor athletics track setting, the attire of the person, and the visibility of the football stadium and tiered stands.', 'DimensionName': 'Time order understanding', 'VideoID': '204.mp4'}, '7': {'Q': 'What object is initially focused at the start of the video?', 'A': \"The video starts with a close-up of a small orange flower being held by a person's hand.\", 'DimensionName': 'Time order understanding', 'VideoID': '205.webm'}, '8': {'Q': 'Is the flower brought closer to or moved further away from the camera as the video progresses?', 'A': 'The flower is moved further away from the camera as the video progresses.', 'DimensionName': 'Time order understanding', 'VideoID': '205.webm'}, '9': {'Q': 'Describe the motion of the hand holding the flower throughout the video.', 'A': 'The hand holding the flower shows a continuous movement away from the camera, resulting in the flower appearing smaller.', 'DimensionName': 'Time order understanding', 'VideoID': '205.webm'}, '10': {'Q': 'By the end of the video, has the size of the flower in the frame increased, decreased, or remained the same?', 'A': 'By the end of the video, the size of the flower in the frame has decreased.', 'DimensionName': 'Time order understanding', 'VideoID': '205.webm'}, '11': {'Q': 'What action is performed with the flower in relation to the camera?', 'A': 'The action performed with the flower is presenting it to the camera before gradually moving it away.', 'DimensionName': 'Time order understanding', 'VideoID': '205.webm'}, '12': {'Q': 'At the midpoint of the video, would the flower appear larger, smaller, or the same size as it did at the beginning?', 'A': 'At the midpoint of the video, the flower would appear smaller than it did at the beginning because it is continually moved away from the camera.', 'DimensionName': 'Time order understanding', 'VideoID': '205.webm'}, '13': {'Q': 'What is the direction of the movement of the flower in the video in relation to the camera?', 'A': 'The flower is being moved away/ backwards from the camera as the video progresses.', 'DimensionName': 'Time order understanding', 'VideoID': '205.webm'}, '14': {'Q': \"Does the object move towards the hand's initial position or away from it as the action progresses?\", 'A': \"The object moves away from the hand's initial position, indicating the hand initiates the motion and pushes the object to a new location rather than pulling it towards itself.\", 'DimensionName': 'Time order understanding', 'VideoID': '206.mp4'}, '15': {'Q': \"Identify the final resting position of the snack packet relative to its starting point. Was it closer to the hand's entry point or further away?\", 'A': \"The snack packet ends up further away from the hand's entry point, revealing that the action involved pushing the snack across the surface away from where the hand first appeared.\", 'DimensionName': 'Time order understanding', 'VideoID': '206.mp4'}, '16': {'Q': 'From the observed action, deduce whether the movement of the object was linear or if it changed directions mid-way.', 'A': 'The movement was linear, as the object moved from one side to the other without any indication of changing directions, showcasing a straightforward push action.', 'DimensionName': 'Time order understanding', 'VideoID': '206.mp4'}, '17': {'Q': \"Can the sequence of the snack's movement be described as starting from the left going to the right, based on the hand's interaction?\", 'A': \"No, the sequence begins on the right and moves to the left, as indicated by the hand's action pushing the snack across the table, contrary to the suggested direction.\", 'DimensionName': 'Time order understanding', 'VideoID': '206.mp4'}, '18': {'Q': 'Is there evidence of the hand pulling the object closer to its initial point of entry into the frame?', 'A': 'No, there is no evidence of a pulling action. The hand moves the object away from its initial point of entry, consistent with a pushing motion.', 'DimensionName': 'Time order understanding', 'VideoID': '206.mp4'}, '19': {'Q': \"Based on the directional movement observed, was the hand's action more indicative of pulling or pushing the snack across the surface?\", 'A': \"The hand's action was more indicative of pushing the snack across the surface, as it moved the object from one side to the other with a push, rather than pulling it with continuous contact.\", 'DimensionName': 'Time order understanding', 'VideoID': '206.mp4'}, '20': {'Q': 'What is the primary action depicted in the video involving a red soft-drink can?', 'A': 'The primary action is pouring water into the can.', 'DimensionName': 'Time order understanding', 'VideoID': '207.webm'}, '21': {'Q': 'Is someone drinking a soft-drink can in the video?', 'A': 'No, the video does not show the drinking activity. The video focuses on the action of pouring water into the red soda can.', 'DimensionName': 'Time order understanding', 'VideoID': '207.webm'}, '22': {'Q': 'Is the video showing the activity of taking out liquid from the soda can?', 'A': 'No, the video does not show the activity of taking out the liquid from the soda can. The video shows water being poured into the soda can.', 'DimensionName': 'Time order understanding', 'VideoID': '207.webm'}, '23': {'Q': 'Is the liquid taken out of the soda can or is it poured into the soda can?', 'A': 'The liquid is poured into the soda can. ', 'DimensionName': 'Time order understanding', 'VideoID': '207.webm'}, '24': {'Q': 'What is happening in the video?', 'A': 'In the video, water is shown being poured into the red soda can. The main action in the video is the process of pouring water into the can..', 'DimensionName': 'Time order understanding', 'VideoID': '207.webm'}, '25': {'Q': 'Is someone taking out a drink from the soda can in the video?', 'A': 'No, the video shows the activity of pouring water into the soda can, not taking it out.', 'DimensionName': 'Time order understanding', 'VideoID': '207.webm'}, '26': {'Q': 'Is there any indication of the can being moved from its resting place on the wooden floor during the pouring action?', 'A': 'There is no depiction of the can being moved; the focus remains on pouring water into the can that is resting on the wooden floor. This implies the can remain stationary throughout the depicted action.', 'DimensionName': 'Time order understanding', 'VideoID': '207.webm'}, '27': {'Q': 'What is happening in the video?', 'A': 'The video starts with a close-up of a small plant pot on a wooden floor. As it progresses, a hand enters the frame and it pulls the plant pot toward the camera, causing a zoom effect.', 'DimensionName': 'Time order understanding', 'VideoID': '208.webm'}, '28': {'Q': 'Is the hand initially holding the plant pot or does it enter the frame without holding anything?', 'A': 'The hand enters the frame without holding anything, afterwards the hand holds the plant pot and pulls it towards the camera.', 'DimensionName': 'Time order understanding', 'VideoID': '208.webm'}, '29': {'Q': 'Does the plant pot move away from the camera or towards the camera when pulled?', 'A': 'The plant pot moves toward the camera when pulled.', 'DimensionName': 'Time order understanding', 'VideoID': '208.webm'}, '30': {'Q': 'Is the zoom effect caused by the camera moving closer to the plant pot or by the plant pot moving closer to the camera?', 'A': 'The zoom effect is caused by the plant pot moving closer to the camera.', 'DimensionName': 'Time order understanding', 'VideoID': '208.webm'}, '31': {'Q': 'Does the video showcase multiple plant pots being interacted with or just a single plant pot?', 'A': 'The video showcases just a single plant pot being interacted with.', 'DimensionName': 'Time order understanding', 'VideoID': '208.webm'}, '32': {'Q': 'At the end of the video, will the pot plant look smaller or bigger in the frame due to the moving action?', 'A': 'As the pot plant is moved towards the camera, it will look bigger towards the end of the video.', 'DimensionName': 'Time order understanding', 'VideoID': '208.webm'}, '33': {'Q': 'By the end of the video, has the plant pot moved a significant distance or remained relatively in the same place from where it started?', 'A': 'By the end of the video, the plant pot has moved a significant distance towards the camera.', 'DimensionName': 'Time order understanding', 'VideoID': '208.webm'}, '34': {'Q': 'What object is moved across the table?', 'A': 'A glass bottle is moved across the table.', 'DimensionName': 'Time order understanding', 'VideoID': '209.webm'}, '35': {'Q': \"Is the direction of the bottle's movement from right to left or from left to right?\", 'A': \"The direction of the bottle's movement is from left to right.\", 'DimensionName': 'Time order understanding', 'VideoID': '209.webm'}, '36': {'Q': 'Before being pushed, where is the glass bottle situated, towards the left side or the right side?', 'A': 'The glass bottle is situated on a wooden table on the left side.', 'DimensionName': 'Time order understanding', 'VideoID': '209.webm'}, '37': {'Q': 'Identify the action performed by the person in the video.', 'A': 'The person performs the action of pushing a glass bottle from left towards the right. This action is the central focus of the video, showcasing the interaction between the person and the object.', 'DimensionName': 'Time order understanding', 'VideoID': '209.webm'}, '38': {'Q': 'What is happening in the video?', 'A': 'The video begins with a glass bottle resting on a wooden table among other items such as boxes. Subsequently, a person pushes the bottle from the left side towards the right, and this is the sole action captured in the video.', 'DimensionName': 'Time order understanding', 'VideoID': '209.webm'}, '39': {'Q': 'In which direction, the person is pushing the bottle in the video?', 'A': 'The person is pushing the bottle towards the right direction in the video.', 'DimensionName': 'Time order understanding', 'VideoID': '209.webm'}, '40': {'Q': 'Does the video showcase multiple actions involving the bottle or a single action?', 'A': \"The video showcases a single action involving the bottle: it being pushed from left side to right. This answer clarifies the simplicity of the video's content regarding the action performed.\", 'DimensionName': 'Time order understanding', 'VideoID': '209.webm'}, '41': {'Q': 'What is the primary fine-grained action shown in the video?', 'A': 'The primary fine-grained action shown in the video is the person pushing the glass bottle from the left side towards the right on the wooden table.', 'DimensionName': 'Time order understanding', 'VideoID': '209.webm'}, '42': {'Q': 'What is the first object that the person interacts within the video?', 'A': 'The first object interacted with is the wooden drawer, as the person opens it to retrieve an item inside.', 'DimensionName': 'Time order understanding', 'VideoID': '210.webm'}, '43': {'Q': 'What is the primary action being performed in the video?', 'A': \"The video shows a person's hand used to open a wooden drawer in order to take out a dark rounded wooden box.\", 'DimensionName': 'Time order understanding', 'VideoID': '210.webm'}, '44': {'Q': 'Does the person place anything into the drawer at any point in the video?', 'A': 'No, the person does not place anything into the drawer; the primary action involves taking an item out.', 'DimensionName': 'Time order understanding', 'VideoID': '210.webm'}, '45': {'Q': \"Is the primary function of the person's interaction with the drawer to organize its contents?\", 'A': \"No, the primary function is not to organize the drawer's contents but to retrieve a dark rounded wooden box from it.\", 'DimensionName': 'Time order understanding', 'VideoID': '210.webm'}, '46': {'Q': 'What is happening in the video?', 'A': \"The video shows a person's hand appearing and then it opens a drawer to take out a dark rounded wooden box from it. The primary action in the video is the process of retrieving the box from the drawer..\", 'DimensionName': 'Time order understanding', 'VideoID': '210.webm'}, '47': {'Q': 'Describe how the person is placing the dark rounded wooden box into the wooden drawer?', 'A': 'The person is not placing the dark rounded wooden box. Instead, he is taking out the wooden box from the drawer.', 'DimensionName': 'Time order understanding', 'VideoID': '210.webm'}, '48': {'Q': 'Is the action of opening the drawer followed by immediately closing it, or is there an intermediate step involving another object?', 'A': 'There is an intermediate step involving another object; after opening the drawer, a dark rounded wooden box is retrieved before any potential closing action.', 'DimensionName': 'Time order understanding', 'VideoID': '210.webm'}, '49': {'Q': 'Is the person placing a dark rounded wooden box inside the drawer?', 'A': 'No, the person is retrieving a dark rounded wooden box from the drawer instead of placing it.', 'DimensionName': 'Time order understanding', 'VideoID': '210.webm'}, '50': {'Q': 'Is the person placing a dark rounded wooden box inside the drawer or retrieving the wooden box from the drawer?', 'A': 'The person is retrieving a dark rounded wooden box from the drawer instead of placing it.', 'DimensionName': 'Time order understanding', 'VideoID': '210.webm'}, '51': {'Q': 'What is the primary action performed by the person in the video?', 'A': 'The primary action performed is the removal of a key from a lock.', 'DimensionName': 'Time order understanding', 'VideoID': '211.webm'}, '52': {'Q': 'What is the person doing in the video?', 'A': 'The video shows a person who is removing a key from a door lock. No other action is performed in the video.', 'DimensionName': 'Time order understanding', 'VideoID': '211.webm'}, '53': {'Q': 'Is the door being unlocked or locked in the footage observed?', 'A': 'The door is neither being unlocked nor locked; the key is being removed from the lock.', 'DimensionName': 'Time order understanding', 'VideoID': '211.webm'}, '54': {'Q': 'Does the person open the door in the video?', 'A': 'There is no activity of opening the door depicted in the video. The video shows the action of removing a key from the door lock.', 'DimensionName': 'Time order understanding', 'VideoID': '211.webm'}, '55': {'Q': \"Is the key present in the person's hand, or it is situated in the lock at the start of the video?\", 'A': 'The key was already inserted into the lock at the start of the video.', 'DimensionName': 'Time order understanding', 'VideoID': '211.webm'}, '56': {'Q': 'Is there any second action shown in the video after removal of the key from the door lock?', 'A': 'No specific action immediately follows the removal of the key in the video, as it primarily showcases the process of removing the key from the lock.', 'DimensionName': 'Time order understanding', 'VideoID': '211.webm'}, '57': {'Q': 'Is there any interaction between a person and the lock aside from the key being removed?', 'A': 'No, there is no other interaction depicted between a person and the lock aside from the key being removed.', 'DimensionName': 'Time order understanding', 'VideoID': '211.webm'}, '58': {'Q': 'Does the person insert the key into the door lock or remove the key from the door lock in the video?', 'A': 'The person removes the key from the door lock in the video.', 'DimensionName': 'Time order understanding', 'VideoID': '211.webm'}, '59': {'Q': 'What is happening in the video?', 'A': \"The video showcases a white door with a golden round lock in front of the camera. A key is already inserted into the lock. Subsequently, a person's hand appears, and they remove the key from the lock. The sole action depicted in the video is the process of removing the key from the lock.\", 'DimensionName': 'Time order understanding', 'VideoID': '211.webm'}, '60': {'Q': 'Does the ship sail from right to left across the body of water?', 'A': \"No, the ship sails from left to right. This directionality can be derived from observing the ship's movement in the video.\", 'DimensionName': 'Time order understanding', 'VideoID': '212.mp4'}, '61': {'Q': 'Is the Statue of Liberty visible before the camera focuses more closely on the ship?', 'A': 'Yes, the Statue of Liberty is visible in the background before the camera zooms in on the ship.', 'DimensionName': 'Time order understanding', 'VideoID': '212.mp4'}, '62': {'Q': 'After the camera moves closer to the ship, does the ship change its direction of sail?', 'A': 'No, the ship continues sailing to the right even after the camera moves closer. The continuity of direction helps in identifying the consistency of the action despite changes in camera perspective.', 'DimensionName': 'Time order understanding', 'VideoID': '212.mp4'}, '63': {'Q': 'Is the ship moving towards the camera or the camera is getting closer to the ship in the video?', 'A': 'The camera is getting closer to the ship in the video. The ship is described as moving from left to right, and the camera moves closer to the ship.', 'DimensionName': 'Time order understanding', 'VideoID': '212.mp4'}, '64': {'Q': 'Does the video showcase any moment where the ship sails from the right towards the left?', 'A': 'No, the video showcases the ship sailing from left to right.', 'DimensionName': 'Time order understanding', 'VideoID': '212.mp4'}, '65': {'Q': 'What is the direction of the movement of the ship in relation to the camera?', 'A': 'The ship is moving from left to right in relation to the camera.', 'DimensionName': 'Time order understanding', 'VideoID': '212.mp4'}, '66': {'Q': 'What is happening in the video?', 'A': 'The video shows a red and black ship sailing from left to right in a body of water, with the Statue of Liberty and city structures in the background. The camera zooms in as the ship continues its rightward movement.', 'DimensionName': 'Time order understanding', 'VideoID': '212.mp4'}, '67': {'Q': 'What objects are initially visible on the wooden desk at the beginning of the video?', 'A': 'A remote and a vase are initially visible on the desk.', 'DimensionName': 'Time order understanding', 'VideoID': '213.webm'}, '68': {'Q': 'In which direction is the remote moved during the video?', 'A': 'The remote is pushed to the left side of the desk.', 'DimensionName': 'Time order understanding', 'VideoID': '213.webm'}, '69': {'Q': 'Does the action of moving the remote occur before or after it sits closely to the vase on the desk?', 'A': 'The action of moving the remote occurs after it is shown sitting closely to the vase.', 'DimensionName': 'Time order understanding', 'VideoID': '213.webm'}, '70': {'Q': 'What is the direction of the movement of the remote from the vase?', 'A': 'The remote is moved towards the left side of the vase.', 'DimensionName': 'Time order understanding', 'VideoID': '213.webm'}, '71': {'Q': 'Is the vase moved at any point in the video?', 'A': 'No, the vase is not moved. The video shows that the remote is being moved, with no depiction of the vase changing position.', 'DimensionName': 'Time order understanding', 'VideoID': '213.webm'}, '72': {'Q': 'Describe the sequence of actions involving the remote as shown in the video.', 'A': 'Initially, the remote is placed close to the vase on a desk, then it is shifted away by being pushed to the left side of the desk.', 'DimensionName': 'Time order understanding', 'VideoID': '213.webm'}, '73': {'Q': 'What is the primary action performed in the video?', 'A': 'The primary action depicts that a person shifts the remote away from the vase by pushing it to the left side of the desk.', 'DimensionName': 'Time order understanding', 'VideoID': '213.webm'}, '74': {'Q': 'Is the remote shifted towards the right side or the left side of the vase in the video?', 'A': 'The remote is shifted towards the left side from the vase in the video.', 'DimensionName': 'Time order understanding', 'VideoID': '213.webm'}, '75': {'Q': 'What is happening in the video?', 'A': 'The video starts by displaying a wooden desk with a remote and a vase placed closely. Following this, a person shifts the remote away from the vase by pushing it to the left side of the desk.', 'DimensionName': 'Time order understanding', 'VideoID': '213.webm'}, '76': {'Q': 'What is the primary action shown in the video?', 'A': 'The primary action shown in the video is the person shifting the remote away from the vase by pushing it to the left side of the desk.', 'DimensionName': 'Time order understanding', 'VideoID': '213.webm'}, '77': {'Q': 'What item is placed inside the drawer first?', 'A': 'A dark rounded wooden box is placed inside the drawer first.', 'DimensionName': 'Time order understanding', 'VideoID': '214.webm'}, '78': {'Q': 'Is the drawer being closed or open at the start of the video?', 'A': 'The drawer is opened at the start of the video.', 'DimensionName': 'Time order understanding', 'VideoID': '214.webm'}, '79': {'Q': 'What is happening in the video?', 'A': \"The video features a close-up of a wooden drawer with various items, including a perfume bottle and a lamp, on top and around it. A person's hands holding a dark rounded wooden box then appear. They open the drawer and put the wooden box into it. The main action in the video is the process of placing the box into the drawer.\", 'DimensionName': 'Time order understanding', 'VideoID': '214.webm'}, '80': {'Q': 'How many items are present inside the wooden drawer when it is opened by the person in the video?', 'A': 'The drawer is shown to be completely empty when it is opened by the person in the video.', 'DimensionName': 'Time order understanding', 'VideoID': '214.webm'}, '81': {'Q': 'Is the person taking out a dark rounded wooden box from the drawer in the video?', 'A': 'No, the person is placing the dark rounded wooden box inside the drawer instead of taking it out.', 'DimensionName': 'Time order understanding', 'VideoID': '214.webm'}, '82': {'Q': 'What is the primary action being performed in the video?', 'A': 'The primary action in the video is the process of placing the dark rounded box into the drawer..', 'DimensionName': 'Time order understanding', 'VideoID': '214.webm'}, '83': {'Q': 'Is the person retrieving the wooden box from the drawer or placing the wooden box into the drawer in the video?', 'A': 'The person in the video is shown to be placing the wooden box into the drawer.', 'DimensionName': 'Time order understanding', 'VideoID': '214.webm'}, '84': {'Q': 'Does the scene primarily depict an activity occurring indoors or outdoors?', 'A': 'Outdoors. The presence of a sunny day, a grassy field, and an expansive, empty road suggests an outdoor setting.', 'DimensionName': 'Time order understanding', 'VideoID': '215.mp4'}, '85': {'Q': \"In which direction relative to the camera's position (towards or away) do the participants move?\", 'A': 'The participants move forward away from the camera.', 'DimensionName': 'Time order understanding', 'VideoID': '215.mp4'}, '86': {'Q': 'Is the camera following the joggers as they move?', 'A': 'No, the camera remains stationary. It does not follow the joggers but captures their movement as they go away from its position.', 'DimensionName': 'Time order understanding', 'VideoID': '215.mp4'}, '87': {'Q': 'Are the joggers moving towards or away from the camera?', 'A': \"Away from the camera. The video depicts that the joggers are moving forward away from the camera's perspective.\", 'DimensionName': 'Time order understanding', 'VideoID': '215.mp4'}, '88': {'Q': 'Are the joggers moving towards the right direction or towards the left direction relative to the camera?', 'A': 'The joggers are moving towards the left direction relative to the camera in the video.', 'DimensionName': 'Time order understanding', 'VideoID': '215.mp4'}, '89': {'Q': 'From the described activities, can you determine if the event takes place at night?', 'A': 'No, the event takes place on a sunny day, indicating it happens during daylight hours, not at night.', 'DimensionName': 'Time order understanding', 'VideoID': '215.mp4'}, '90': {'Q': 'Are the people seen to be jogging in unusually reverse direction in the video?', 'A': 'No, the people in the video are jogging in the typical forward direction, moving away from the camera. There is no depiction of them jogging in an unusually reverse direction. The scene seems to depict a normal outdoor jogging activity on a sunny day.', 'DimensionName': 'Time order understanding', 'VideoID': '215.mp4'}, '91': {'Q': \"What does the person's hand initially hold?\", 'A': \"The person's hand initially holds a ballpoint pen.\", 'DimensionName': 'Time order understanding', 'VideoID': '216.webm'}, '92': {'Q': 'Is the action shown by the person depicted as picking up a pen from the desk in the video?', 'A': 'No, the video indicates that the person drops the pen.', 'DimensionName': 'Time order understanding', 'VideoID': '216.webm'}, '93': {'Q': \"Does the pen stay in the person's hand throughout the video?\", 'A': \"No, the pen does not stay in the person's hand throughout the video. It is dropped and ends up on the floor, which portrays the transition from being held to being let go.\", 'DimensionName': 'Time order understanding', 'VideoID': '216.webm'}, '94': {'Q': 'In the sequence of actions shown, what happens to the ballpoint pen after the person holds it?', 'A': 'After the person holds it, the ballpoint pen is dropped and falls to the floor.', 'DimensionName': 'Time order understanding', 'VideoID': '216.webm'}, '95': {'Q': 'What is happening in the video?', 'A': \"The video depicts a person's hand holding a ballpoint pen initially. Subsequently, the person drops the pen, and it falls to the floor. There are no additional actions in the video.\", 'DimensionName': 'Time order understanding', 'VideoID': '216.webm'}, '96': {'Q': 'Identify the sequence of the main action involving the ballpoint pen from start to finish.', 'A': 'The sequence of the main action starts with the pen being held, followed by it being dropped, and concludes with it landing and remaining on the floor/desk.', 'DimensionName': 'Time order understanding', 'VideoID': '216.webm'}, '97': {'Q': 'Is the pen picked up by the person or dropped by the person in the video?', 'A': 'The pen is dropped by the person in the video.', 'DimensionName': 'Time order understanding', 'VideoID': '216.webm'}, '98': {'Q': 'Identify if the person is picking up a pen in the given video.', 'A': 'No, the pen is dropped by the person, not picked up in the video.', 'DimensionName': 'Time order understanding', 'VideoID': '216.webm'}, '99': {'Q': 'What is the primary action performed in the video?', 'A': \"The primary action performed in the video is the person's hand holding a ballpoint pen initially and subsequently dropping the pen, causing it to fall to the floor.\", 'DimensionName': 'Time order understanding', 'VideoID': '216.webm'}, '100': {'Q': 'What is the action performed by the person in the video with respect to the jar?', 'A': 'The action is closing the lid of the jar.', 'DimensionName': 'Time order understanding', 'VideoID': '217.webm'}, '101': {'Q': 'What is happening in the video?', 'A': \"The video presents a close-up of a kitchen shelf near the sink. A person is holding a black jar's lid in his hand, and gradually, they close the jar using the lid in their hand. The main action in the video is closing the lid of the jar using the hand.\", 'DimensionName': 'Time order understanding', 'VideoID': '217.webm'}, '102': {'Q': 'Does the person open or close the jar in the video?', 'A': 'The person closes the jar.', 'DimensionName': 'Time order understanding', 'VideoID': '217.webm'}, '103': {'Q': 'Is the person shown to be opening the lid of the jar or closing the lid of the jar in the video?', 'A': 'The person is shown to be closing the lid of the jar in the video.', 'DimensionName': 'Time order understanding', 'VideoID': '217.webm'}, '104': {'Q': 'Identify the main action being performed near a kitchen shelf in the video.', 'A': 'The main action is closing the lid of a jar.', 'DimensionName': 'Time order understanding', 'VideoID': '217.webm'}, '105': {'Q': 'Describe the action of opening the lid of the jar in the video?', 'A': 'The video does not show opening the lid of the jar, instead it shows the action of closing the jar using its lid.', 'DimensionName': 'Time order understanding', 'VideoID': '217.webm'}, '106': {'Q': 'What is the person doing in the video?', 'A': 'The person is shown to be closing the lid of the jar using his hand in the video.', 'DimensionName': 'Time order understanding', 'VideoID': '217.webm'}, '107': {'Q': 'Is the person filling the jar with anything in the video?', 'A': \"No, the person is not filling the jar. The person's action is closing the jar using its lid.\", 'DimensionName': 'Time order understanding', 'VideoID': '217.webm'}, '108': {'Q': 'What is the main direction of the flying action of the helicopter?', 'A': 'This video shows a red helicopter in the middle of a city flying in reverse from left towards the right.', 'DimensionName': 'Time order understanding', 'VideoID': '218.mp4'}, '109': {'Q': 'Which direction did the helicopter move in relation to the background buildings?', 'A': 'The helicopter moved from left to right in relation to the background buildings.', 'DimensionName': 'Time order understanding', 'VideoID': '218.mp4'}, '110': {'Q': 'What is the unusual aspect depicted with respect to the helicopter in the video?', 'A': 'The unusual aspect depicted in the video is that the red helicopter is flying in reverse from left to right and rising up in the air in the middle of a city. Typically, helicopters move forward, so the reverse direction of flight is unconventional.', 'DimensionName': 'Time order understanding', 'VideoID': '218.mp4'}, '111': {'Q': \"Determine if the helicopter's movement was towards the left or right from its starting position in the filmed scene.\", 'A': \"The helicopter's movement was towards the right from its starting position, as it was flying in reverse from left towards the right.\", 'DimensionName': 'Time order understanding', 'VideoID': '218.mp4'}, '112': {'Q': 'Did the helicopter descend or ascend during the captured events?', 'A': 'The helicopter ascended during the captured events, rising very high up in the air.', 'DimensionName': 'Time order understanding', 'VideoID': '218.mp4'}, '113': {'Q': \"Describe the helicopter's movement pattern in relation to the city's tall buildings.\", 'A': 'The helicopter moved in reverse direction from left to right while ascending, with tall buildings with glass walls visible in the background.', 'DimensionName': 'Time order understanding', 'VideoID': '218.mp4'}, '114': {'Q': \"Assess whether the helicopter's rising motion was in direct contact with any of the background structures.\", 'A': \"The helicopter's rising motion was not in direct contact with any of the background structures; it rose high with tall buildings in the backdrop but did not touch them.\", 'DimensionName': 'Time order understanding', 'VideoID': '218.mp4'}, '115': {'Q': 'What color is the measuring tape that appears in the video?', 'A': 'The measuring tape is yellow.', 'DimensionName': 'Time order understanding', 'VideoID': '219.webm'}, '116': {'Q': 'What is happening in the video?', 'A': 'The video reveals a yellow measuring tape lying on the ground. A person then pushes the tape from the right side to the left, and no other actions are performed in the video.', 'DimensionName': 'Time order understanding', 'VideoID': '219.webm'}, '117': {'Q': 'What is the direction of the movement of the measuring tape?', 'A': 'The measuring tape is pushed from right to left. This direction is identified by observing the movement of the tape relative to its surroundings.', 'DimensionName': 'Time order understanding', 'VideoID': '219.webm'}, '118': {'Q': 'Is the measuring tape pushed toward the right or towards the left by the person in the video?', 'A': 'The measuring tape is moved towards the left in the video.', 'DimensionName': 'Time order understanding', 'VideoID': '219.webm'}, '119': {'Q': \"By the end of the video, has the measuring tape's position changed compared to its initial state, if yes then in what direction it has been pushed?\", 'A': \"Yes, the measuring tape's position has changed by being pushed from the right side to the left. This change is observable by comparing the tape's initial and final positions in the video.\", 'DimensionName': 'Time order understanding', 'VideoID': '219.webm'}, '120': {'Q': 'How many actions are performed on the measuring tape throughout the video?', 'A': 'Only one action is performed: the tape is pushed from the right side to the left. This is evident from monitoring the entire duration of the video for any activities involving the tape.', 'DimensionName': 'Time order understanding', 'VideoID': '219.webm'}, '121': {'Q': 'What is the direction of the movement of the measuring tape?', 'A': 'The tape is pushed from the right side to the left in the video.', 'DimensionName': 'Time order understanding', 'VideoID': '219.webm'}, '122': {'Q': 'What is the primary action shown in the video?', 'A': 'The primary action shown in the video is a person pushing a yellow measuring tape from the right side to the left on the ground.', 'DimensionName': 'Time order understanding', 'VideoID': '219.webm'}, '123': {'Q': 'What is happening in the video?', 'A': 'The video features a person in a red shirt and black cap skillfully descending a tree using a rope looped around it. They alternate between loosening and tightening the rope for a controlled descent.', 'DimensionName': 'Time order understanding', 'VideoID': '220.mp4'}, '124': {'Q': 'Is the person shown to be climbing the tree in the video?', 'A': 'No, the person in the video is not climbing the tree but rather descending from it using a rope looped around the tree trunk.', 'DimensionName': 'Time order understanding', 'VideoID': '220.mp4'}, '125': {'Q': 'Identify the sequence of actions taken by the person to get down the tree.', 'A': 'The sequence involves supporting themselves by a rope, loosening the rope to move down, and then tightening it again to stop. Repeating this sequence allows the person to descend gradually.', 'DimensionName': 'Time order understanding', 'VideoID': '220.mp4'}, '126': {'Q': 'Does the person use any additional tools or supports apart from the rope to descend the tree?', 'A': 'No, the person solely relies on the rope for support to get down the tree.', 'DimensionName': 'Time order understanding', 'VideoID': '220.mp4'}, '127': {'Q': 'Is the person moving upward or downward on the tree in the video?', 'A': 'The person is decending from the tree, so he is moving downward.', 'DimensionName': 'Time order understanding', 'VideoID': '220.mp4'}, '128': {'Q': 'What visual elements apart from the person and the tree can be seen in the video?', 'A': 'A table with jars and bowls and a dense set of trees are visible in the background.', 'DimensionName': 'Time order understanding', 'VideoID': '220.mp4'}, '129': {'Q': \"Is the person's descent rapid or gradual, and what action contributes to this pace?\", 'A': \"The person's descent is gradual, contributed by the action of loosening and tightening the rope multiple times.\", 'DimensionName': 'Time order understanding', 'VideoID': '220.mp4'}, '130': {'Q': 'Does the dog initiate its movement towards the camera or away from it when it starts walking?', 'A': \"The dog initiates its movement away from the camera. This is determined by noting the direction of the dog's walking, which is backwards, leading it further away from the camera's position initially near it.\", 'DimensionName': 'Time order understanding', 'VideoID': '221.mp4'}, '131': {'Q': \"In which part of the video does the dog finally disappear from the camera's view, is it towards the start of the video or towards the end of the video?\", 'A': \"The dog disappears from the camera's view towards the end of the video. This is because the dog walks all the way back and out of view, indicating its exit occurs after traversing the visible area.\", 'DimensionName': 'Time order understanding', 'VideoID': '221.mp4'}, '132': {'Q': \"Is the dog walking forwards or backwards in relation to the camera's perspective during its movement?\", 'A': \"The dog is walking backwards in relation to the camera's perspective. Since the dog moves away from the camera while facing it, this indicates a backward movement.\", 'DimensionName': 'Time order understanding', 'VideoID': '221.mp4'}, '133': {'Q': 'What is the location of the dog in relation to the camera at the end of the video, is it located near to the camera or far away from the camera?', 'A': 'The location of the dog is far away from the camera on the floor at a hallway towards the end of the video.', 'DimensionName': 'Time order understanding', 'VideoID': '221.mp4'}, '134': {'Q': 'What is happening in the video?', 'A': 'The video shows a white dog standing near to the camera on the floor at a hallway. As the video progresses, the dog is shown to be walking backwards while facing the camera across a hallway in a living room away from the camera all the way back and out of view.', 'DimensionName': 'Time order understanding', 'VideoID': '221.mp4'}, '135': {'Q': \"By the video's end, is the dog more or less visible to the camera compared to its initial state?\", 'A': \"By the video's end, the dog is less visible to the camera compared to its initial state. Given that the dog walks out of view, it transitions from being clearly visible to not visible at all.\", 'DimensionName': 'Time order understanding', 'VideoID': '221.mp4'}, '136': {'Q': \"Based on the dog's walking direction, is its movement primarily towards or away from the initial point of observation?\", 'A': \"The dog's movement is primarily away from the initial point of observation. Walking backwards across the hallway away from the camera signifies that its movement is directed further from the starting point.\", 'DimensionName': 'Time order understanding', 'VideoID': '221.mp4'}, '137': {'Q': 'Is the dog moving near to the camera or away from the camera in the video?', 'A': 'The dog is moving away from the camera as the video progresses.', 'DimensionName': 'Time order understanding', 'VideoID': '221.mp4'}, '138': {'Q': 'What is happening in the video?', 'A': \"In the video, a person enters a living room with a guitar, interacts with a cat on an armchair, kicks the cat away, and plays the guitar. The cat runs upstairs, knocks over a plant vase, which falls on the person's head, causing them pain.\", 'DimensionName': 'Time order understanding', 'VideoID': '222.mp4'}, '139': {'Q': 'Before sitting on the armchair to play the guitar, what action does the person perform involving the cat?', 'A': 'The person pushes the cat away from the chair and kicks it away.', 'DimensionName': 'Time order understanding', 'VideoID': '222.mp4'}, '140': {'Q': 'Which direction does the cat move after being pushed and kicked away by the person?', 'A': 'The cat runs towards the right and moves up the staircase behind the armchair.', 'DimensionName': 'Time order understanding', 'VideoID': '222.mp4'}, '141': {'Q': 'What incident occurs immediately after the cat runs up the staircase?', 'A': \"A plant vase is knocked off by the cat, which falls onto the person's head.\", 'DimensionName': 'Time order understanding', 'VideoID': '222.mp4'}, '142': {'Q': 'After what specific event does the person hold onto his head in pain?', 'A': 'The person holds onto his head in pain immediately after the plant vase falls on their head.', 'DimensionName': 'Time order understanding', 'VideoID': '222.mp4'}, '143': {'Q': 'Describe the sequence of actions from when the person first plays the guitar to sitting on the armchair.', 'A': 'The person grabs and pushes the cat away from the chair, kicks it away, and finally sits on the armchair to play the guitar again.', 'DimensionName': 'Time order understanding', 'VideoID': '222.mp4'}, '144': {'Q': 'Is the person shown to be happy towards the end of the video?', 'A': 'No, the person experiences an incident where a plant vase falls on their head, causing pain, which suggests a negative or uncomfortable situation.', 'DimensionName': 'Time order understanding', 'VideoID': '222.mp4'}, '145': {'Q': 'Does the helicopter initially fly from left to right before any maneuvers?', 'A': 'No, the helicopter initially flies from right to left in the video.', 'DimensionName': 'Time order understanding', 'VideoID': '223.mp4'}, '146': {'Q': \"Is the helicopter's descent towards landing immediate after it takes off?\", 'A': 'No, the helicopter descends after flying for a while and performing a turn.', 'DimensionName': 'Time order understanding', 'VideoID': '223.mp4'}, '147': {'Q': 'Does the helicopter ascend or descend as it approaches the camera towards the start of the video?', 'A': 'The helicopter descends as it approaches the camera.', 'DimensionName': 'Time order understanding', 'VideoID': '223.mp4'}, '148': {'Q': 'Does the helicopter turn away from the camera or towards the camera towards the end of the video?', 'A': 'The helicopter turns away from the camera at the end of the video.', 'DimensionName': 'Time order understanding', 'VideoID': '223.mp4'}, '149': {'Q': 'Is the pilot shown to be coming out of the helicopter after it descends to the ground towards the end of the video?', 'A': 'No, the video does not show any pilot coming out of the helicopter after it descends.', 'DimensionName': 'Time order understanding', 'VideoID': '223.mp4'}, '150': {'Q': 'Does the helicopter make a rapid descent towards the ground immediately after its turn?', 'A': 'No', 'DimensionName': 'Time order understanding', 'VideoID': '223.mp4'}, '151': {'Q': \"Is the backdrop of the helicopter's journey primarily urban buildings?\", 'A': 'No, it is a mountain with trees.', 'DimensionName': 'Time order understanding', 'VideoID': '223.mp4'}}}\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>Q</th>\n",
+       "      <th>A</th>\n",
+       "      <th>DimensionName</th>\n",
+       "      <th>VideoID</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>Does the video depict a person performing an a...</td>\n",
+       "      <td>No, the person is running backwards, which cha...</td>\n",
+       "      <td>Time order understanding</td>\n",
+       "      <td>204.mp4</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>What is happening in the video?</td>\n",
+       "      <td>The video features a person in a blue running ...</td>\n",
+       "      <td>Time order understanding</td>\n",
+       "      <td>204.mp4</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>Is the person's gear more suitable for indoor ...</td>\n",
+       "      <td>The person's gear, consisting of a blue runnin...</td>\n",
+       "      <td>Time order understanding</td>\n",
+       "      <td>204.mp4</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>Is the person running towards the camera or ru...</td>\n",
+       "      <td>The person is running away from the camera by ...</td>\n",
+       "      <td>Time order understanding</td>\n",
+       "      <td>204.mp4</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>Is the person running in clockwise direction o...</td>\n",
+       "      <td>The person is running in anticlockwise directi...</td>\n",
+       "      <td>Time order understanding</td>\n",
+       "      <td>204.mp4</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>5</th>\n",
+       "      <td>Does the person drink anything in the video wh...</td>\n",
+       "      <td>No, the person is not shown to be drinking any...</td>\n",
+       "      <td>Time order understanding</td>\n",
+       "      <td>204.mp4</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>6</th>\n",
+       "      <td>Is the video capturing an event happening indo...</td>\n",
+       "      <td>The video captures an outdoor event, suggested...</td>\n",
+       "      <td>Time order understanding</td>\n",
+       "      <td>204.mp4</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>7</th>\n",
+       "      <td>What object is initially focused at the start ...</td>\n",
+       "      <td>The video starts with a close-up of a small or...</td>\n",
+       "      <td>Time order understanding</td>\n",
+       "      <td>205.webm</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>8</th>\n",
+       "      <td>Is the flower brought closer to or moved furth...</td>\n",
+       "      <td>The flower is moved further away from the came...</td>\n",
+       "      <td>Time order understanding</td>\n",
+       "      <td>205.webm</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>9</th>\n",
+       "      <td>Describe the motion of the hand holding the fl...</td>\n",
+       "      <td>The hand holding the flower shows a continuous...</td>\n",
+       "      <td>Time order understanding</td>\n",
+       "      <td>205.webm</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                                   Q  \\\n",
+       "0  Does the video depict a person performing an a...   \n",
+       "1                    What is happening in the video?   \n",
+       "2  Is the person's gear more suitable for indoor ...   \n",
+       "3  Is the person running towards the camera or ru...   \n",
+       "4  Is the person running in clockwise direction o...   \n",
+       "5  Does the person drink anything in the video wh...   \n",
+       "6  Is the video capturing an event happening indo...   \n",
+       "7  What object is initially focused at the start ...   \n",
+       "8  Is the flower brought closer to or moved furth...   \n",
+       "9  Describe the motion of the hand holding the fl...   \n",
+       "\n",
+       "                                                   A  \\\n",
+       "0  No, the person is running backwards, which cha...   \n",
+       "1  The video features a person in a blue running ...   \n",
+       "2  The person's gear, consisting of a blue runnin...   \n",
+       "3  The person is running away from the camera by ...   \n",
+       "4  The person is running in anticlockwise directi...   \n",
+       "5  No, the person is not shown to be drinking any...   \n",
+       "6  The video captures an outdoor event, suggested...   \n",
+       "7  The video starts with a close-up of a small or...   \n",
+       "8  The flower is moved further away from the came...   \n",
+       "9  The hand holding the flower shows a continuous...   \n",
+       "\n",
+       "              DimensionName   VideoID  \n",
+       "0  Time order understanding   204.mp4  \n",
+       "1  Time order understanding   204.mp4  \n",
+       "2  Time order understanding   204.mp4  \n",
+       "3  Time order understanding   204.mp4  \n",
+       "4  Time order understanding   204.mp4  \n",
+       "5  Time order understanding   204.mp4  \n",
+       "6  Time order understanding   204.mp4  \n",
+       "7  Time order understanding  205.webm  \n",
+       "8  Time order understanding  205.webm  \n",
+       "9  Time order understanding  205.webm  "
+      ]
+     },
+     "execution_count": 50,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from datasets import load_dataset\n",
+    "\n",
+    "data_path = \"/mnt/sfs-common/krhu/lmms-eval-internal/build_cvrr.py\"\n",
+    "df = load_dataset(data_path, split=\"test\").to_pandas()\n",
+    "df.head(10)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 51,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>Q</th>\n",
+       "      <th>A</th>\n",
+       "      <th>DimensionName</th>\n",
+       "      <th>VideoID</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>Does the video depict a person performing an a...</td>\n",
+       "      <td>No, the person is running backwards, which cha...</td>\n",
+       "      <td>Time order understanding</td>\n",
+       "      <td>204.mp4</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>What is happening in the video?</td>\n",
+       "      <td>The video features a person in a blue running ...</td>\n",
+       "      <td>Time order understanding</td>\n",
+       "      <td>204.mp4</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>Is the person's gear more suitable for indoor ...</td>\n",
+       "      <td>The person's gear, consisting of a blue runnin...</td>\n",
+       "      <td>Time order understanding</td>\n",
+       "      <td>204.mp4</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>Is the person running towards the camera or ru...</td>\n",
+       "      <td>The person is running away from the camera by ...</td>\n",
+       "      <td>Time order understanding</td>\n",
+       "      <td>204.mp4</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>Is the person running in clockwise direction o...</td>\n",
+       "      <td>The person is running in anticlockwise directi...</td>\n",
+       "      <td>Time order understanding</td>\n",
+       "      <td>204.mp4</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>5</th>\n",
+       "      <td>Does the person drink anything in the video wh...</td>\n",
+       "      <td>No, the person is not shown to be drinking any...</td>\n",
+       "      <td>Time order understanding</td>\n",
+       "      <td>204.mp4</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>6</th>\n",
+       "      <td>Is the video capturing an event happening indo...</td>\n",
+       "      <td>The video captures an outdoor event, suggested...</td>\n",
+       "      <td>Time order understanding</td>\n",
+       "      <td>204.mp4</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>7</th>\n",
+       "      <td>What object is initially focused at the start ...</td>\n",
+       "      <td>The video starts with a close-up of a small or...</td>\n",
+       "      <td>Time order understanding</td>\n",
+       "      <td>205.webm</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>8</th>\n",
+       "      <td>Is the flower brought closer to or moved furth...</td>\n",
+       "      <td>The flower is moved further away from the came...</td>\n",
+       "      <td>Time order understanding</td>\n",
+       "      <td>205.webm</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>9</th>\n",
+       "      <td>Describe the motion of the hand holding the fl...</td>\n",
+       "      <td>The hand holding the flower shows a continuous...</td>\n",
+       "      <td>Time order understanding</td>\n",
+       "      <td>205.webm</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>10</th>\n",
+       "      <td>By the end of the video, has the size of the f...</td>\n",
+       "      <td>By the end of the video, the size of the flowe...</td>\n",
+       "      <td>Time order understanding</td>\n",
+       "      <td>205.webm</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>11</th>\n",
+       "      <td>What action is performed with the flower in re...</td>\n",
+       "      <td>The action performed with the flower is presen...</td>\n",
+       "      <td>Time order understanding</td>\n",
+       "      <td>205.webm</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>12</th>\n",
+       "      <td>At the midpoint of the video, would the flower...</td>\n",
+       "      <td>At the midpoint of the video, the flower would...</td>\n",
+       "      <td>Time order understanding</td>\n",
+       "      <td>205.webm</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>13</th>\n",
+       "      <td>What is the direction of the movement of the f...</td>\n",
+       "      <td>The flower is being moved away/ backwards from...</td>\n",
+       "      <td>Time order understanding</td>\n",
+       "      <td>205.webm</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>14</th>\n",
+       "      <td>Does the object move towards the hand's initia...</td>\n",
+       "      <td>The object moves away from the hand's initial ...</td>\n",
+       "      <td>Time order understanding</td>\n",
+       "      <td>206.mp4</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>15</th>\n",
+       "      <td>Identify the final resting position of the sna...</td>\n",
+       "      <td>The snack packet ends up further away from the...</td>\n",
+       "      <td>Time order understanding</td>\n",
+       "      <td>206.mp4</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>16</th>\n",
+       "      <td>From the observed action, deduce whether the m...</td>\n",
+       "      <td>The movement was linear, as the object moved f...</td>\n",
+       "      <td>Time order understanding</td>\n",
+       "      <td>206.mp4</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>17</th>\n",
+       "      <td>Can the sequence of the snack's movement be de...</td>\n",
+       "      <td>No, the sequence begins on the right and moves...</td>\n",
+       "      <td>Time order understanding</td>\n",
+       "      <td>206.mp4</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>18</th>\n",
+       "      <td>Is there evidence of the hand pulling the obje...</td>\n",
+       "      <td>No, there is no evidence of a pulling action. ...</td>\n",
+       "      <td>Time order understanding</td>\n",
+       "      <td>206.mp4</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>19</th>\n",
+       "      <td>Based on the directional movement observed, wa...</td>\n",
+       "      <td>The hand's action was more indicative of pushi...</td>\n",
+       "      <td>Time order understanding</td>\n",
+       "      <td>206.mp4</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                                    Q  \\\n",
+       "0   Does the video depict a person performing an a...   \n",
+       "1                     What is happening in the video?   \n",
+       "2   Is the person's gear more suitable for indoor ...   \n",
+       "3   Is the person running towards the camera or ru...   \n",
+       "4   Is the person running in clockwise direction o...   \n",
+       "5   Does the person drink anything in the video wh...   \n",
+       "6   Is the video capturing an event happening indo...   \n",
+       "7   What object is initially focused at the start ...   \n",
+       "8   Is the flower brought closer to or moved furth...   \n",
+       "9   Describe the motion of the hand holding the fl...   \n",
+       "10  By the end of the video, has the size of the f...   \n",
+       "11  What action is performed with the flower in re...   \n",
+       "12  At the midpoint of the video, would the flower...   \n",
+       "13  What is the direction of the movement of the f...   \n",
+       "14  Does the object move towards the hand's initia...   \n",
+       "15  Identify the final resting position of the sna...   \n",
+       "16  From the observed action, deduce whether the m...   \n",
+       "17  Can the sequence of the snack's movement be de...   \n",
+       "18  Is there evidence of the hand pulling the obje...   \n",
+       "19  Based on the directional movement observed, wa...   \n",
+       "\n",
+       "                                                    A  \\\n",
+       "0   No, the person is running backwards, which cha...   \n",
+       "1   The video features a person in a blue running ...   \n",
+       "2   The person's gear, consisting of a blue runnin...   \n",
+       "3   The person is running away from the camera by ...   \n",
+       "4   The person is running in anticlockwise directi...   \n",
+       "5   No, the person is not shown to be drinking any...   \n",
+       "6   The video captures an outdoor event, suggested...   \n",
+       "7   The video starts with a close-up of a small or...   \n",
+       "8   The flower is moved further away from the came...   \n",
+       "9   The hand holding the flower shows a continuous...   \n",
+       "10  By the end of the video, the size of the flowe...   \n",
+       "11  The action performed with the flower is presen...   \n",
+       "12  At the midpoint of the video, the flower would...   \n",
+       "13  The flower is being moved away/ backwards from...   \n",
+       "14  The object moves away from the hand's initial ...   \n",
+       "15  The snack packet ends up further away from the...   \n",
+       "16  The movement was linear, as the object moved f...   \n",
+       "17  No, the sequence begins on the right and moves...   \n",
+       "18  No, there is no evidence of a pulling action. ...   \n",
+       "19  The hand's action was more indicative of pushi...   \n",
+       "\n",
+       "               DimensionName   VideoID  \n",
+       "0   Time order understanding   204.mp4  \n",
+       "1   Time order understanding   204.mp4  \n",
+       "2   Time order understanding   204.mp4  \n",
+       "3   Time order understanding   204.mp4  \n",
+       "4   Time order understanding   204.mp4  \n",
+       "5   Time order understanding   204.mp4  \n",
+       "6   Time order understanding   204.mp4  \n",
+       "7   Time order understanding  205.webm  \n",
+       "8   Time order understanding  205.webm  \n",
+       "9   Time order understanding  205.webm  \n",
+       "10  Time order understanding  205.webm  \n",
+       "11  Time order understanding  205.webm  \n",
+       "12  Time order understanding  205.webm  \n",
+       "13  Time order understanding  205.webm  \n",
+       "14  Time order understanding   206.mp4  \n",
+       "15  Time order understanding   206.mp4  \n",
+       "16  Time order understanding   206.mp4  \n",
+       "17  Time order understanding   206.mp4  \n",
+       "18  Time order understanding   206.mp4  \n",
+       "19  Time order understanding   206.mp4  "
+      ]
+     },
+     "execution_count": 51,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "df.head(20)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 52,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from datasets import Dataset, Features, Value\n",
+    "import pandas as pd\n",
+    "\n",
+    "# Define the features for the dataset\n",
+    "features = Features(\n",
+    "    {\n",
+    "        \"Q\": Value(dtype=\"string\"),\n",
+    "        \"A\": Value(dtype=\"string\"),\n",
+    "        \"DimensionName\": Value(dtype=\"string\"),\n",
+    "        \"VideoID\": Value(dtype=\"string\"),\n",
+    "    }\n",
+    ")\n",
+    "\n",
+    "df_items = {\n",
+    "    \"Q\": [],\n",
+    "    \"A\": [],\n",
+    "    \"DimensionName\": [],\n",
+    "    \"VideoID\": [],\n",
+    "}\n",
+    "\n",
+    "for idx, row in df.iterrows():\n",
+    "    df_items[\"Q\"].append(str(row[\"Q\"]))\n",
+    "    df_items[\"A\"].append(str(row[\"A\"]))\n",
+    "    df_items[\"DimensionName\"].append(str(row[\"DimensionName\"]))\n",
+    "    df_items[\"VideoID\"].append(str(row[\"VideoID\"]))\n",
+    "\n",
+    "df_items = pd.DataFrame(df_items)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 53,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>Q</th>\n",
+       "      <th>A</th>\n",
+       "      <th>DimensionName</th>\n",
+       "      <th>VideoID</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>Does the video depict a person performing an a...</td>\n",
+       "      <td>No, the person is running backwards, which cha...</td>\n",
+       "      <td>Time order understanding</td>\n",
+       "      <td>204.mp4</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>What is happening in the video?</td>\n",
+       "      <td>The video features a person in a blue running ...</td>\n",
+       "      <td>Time order understanding</td>\n",
+       "      <td>204.mp4</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>Is the person's gear more suitable for indoor ...</td>\n",
+       "      <td>The person's gear, consisting of a blue runnin...</td>\n",
+       "      <td>Time order understanding</td>\n",
+       "      <td>204.mp4</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>Is the person running towards the camera or ru...</td>\n",
+       "      <td>The person is running away from the camera by ...</td>\n",
+       "      <td>Time order understanding</td>\n",
+       "      <td>204.mp4</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>Is the person running in clockwise direction o...</td>\n",
+       "      <td>The person is running in anticlockwise directi...</td>\n",
+       "      <td>Time order understanding</td>\n",
+       "      <td>204.mp4</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                                   Q  \\\n",
+       "0  Does the video depict a person performing an a...   \n",
+       "1                    What is happening in the video?   \n",
+       "2  Is the person's gear more suitable for indoor ...   \n",
+       "3  Is the person running towards the camera or ru...   \n",
+       "4  Is the person running in clockwise direction o...   \n",
+       "\n",
+       "                                                   A  \\\n",
+       "0  No, the person is running backwards, which cha...   \n",
+       "1  The video features a person in a blue running ...   \n",
+       "2  The person's gear, consisting of a blue runnin...   \n",
+       "3  The person is running away from the camera by ...   \n",
+       "4  The person is running in anticlockwise directi...   \n",
+       "\n",
+       "              DimensionName  VideoID  \n",
+       "0  Time order understanding  204.mp4  \n",
+       "1  Time order understanding  204.mp4  \n",
+       "2  Time order understanding  204.mp4  \n",
+       "3  Time order understanding  204.mp4  \n",
+       "4  Time order understanding  204.mp4  "
+      ]
+     },
+     "execution_count": 53,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "df_items.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 54,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dataset = Dataset.from_pandas(df_items, features=features)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 55,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 1515.28ba/s]\n",
+      "Uploading the dataset shards: 100%|██████████| 1/1 [00:01<00:00,  1.33s/it]\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "CommitInfo(commit_url='https://huggingface.co/datasets/lmms-lab/CVRR-ES/commit/dd4d8ad2c74226ee6d3d8d53b2bda853e6e0503f', commit_message='Upload dataset', commit_description='', oid='dd4d8ad2c74226ee6d3d8d53b2bda853e6e0503f', pr_url=None, pr_revision=None, pr_num=None)"
+      ]
+     },
+     "execution_count": 55,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "hub_dataset_path = \"lmms-lab/CVRR-ES\"\n",
+    "dataset.push_to_hub(repo_id=hub_dataset_path, split=\"test\", config_name=\"unusual_and_physically_anomalous_activities\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "lmms-eval",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.19"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}

	video_name	question	answer
0	v_-6dz6tBH77I	Please provide a detailed description of the v...	The video is of a man in athletic clothes stan...
1	v_-D1gdv_gQyw	Please provide a detailed description of the v...	The video begins with a man holding a knife in...
2	v_-HpCLXdtcas	Please provide a detailed description of the v...	A man is standing behind a barbell placed on t...
3	v_-IMXSEIabMM	Please provide a detailed description of the v...	The video starts with two people standing behi...
4	v_-MbZ-W0AbN0	Please provide a detailed description of the v...	The video starts with an advertisement for fur...
	Q	A	DimensionName	VideoID
0	Does the video depict a person performing an a...	No, the person is running backwards, which cha...	Time order understanding	204.mp4
1	What is happening in the video?	The video features a person in a blue running ...	Time order understanding	204.mp4
2	Is the person's gear more suitable for indoor ...	The person's gear, consisting of a blue runnin...	Time order understanding	204.mp4
3	Is the person running towards the camera or ru...	The person is running away from the camera by ...	Time order understanding	204.mp4
4	Is the person running in clockwise direction o...	The person is running in anticlockwise directi...	Time order understanding	204.mp4
5	Does the person drink anything in the video wh...	No, the person is not shown to be drinking any...	Time order understanding	204.mp4
6	Is the video capturing an event happening indo...	The video captures an outdoor event, suggested...	Time order understanding	204.mp4
7	What object is initially focused at the start ...	The video starts with a close-up of a small or...	Time order understanding	205.webm
8	Is the flower brought closer to or moved furth...	The flower is moved further away from the came...	Time order understanding	205.webm
9	Describe the motion of the hand holding the fl...	The hand holding the flower shows a continuous...	Time order understanding	205.webm