diff --git a/.github/issue_template.md b/.github/issue_template.md old mode 100644 new mode 100755 diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md old mode 100644 new mode 100755 diff --git a/.github/workflows/black.yml b/.github/workflows/black.yml old mode 100644 new mode 100755 diff --git a/.gitignore b/.gitignore old mode 100644 new mode 100755 index a2e6a0ba..2557ab1b --- a/.gitignore +++ b/.gitignore @@ -29,3 +29,11 @@ ckpt pretrained/ LLaVA/ *logs +temp/ +InternVL/ +logs/ +data/ +llava-video/ +Video-MME/ +VATEX/ +lmms_eval/tasks/vatex/__pycache__/utils.cpython-310.pyc diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml old mode 100644 new mode 100755 diff --git a/README.md b/README.md old mode 100644 new mode 100755 index 04b62aef..72b15fb1 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -

+

@@ -6,79 +6,31 @@ > Accelerating the development of large multimodal models (LMMs) with `lmms-eval` -🏠 [Homepage](https://lmms-lab.github.io/) | 🎉 [Blog](https://lmms-lab.github.io/lmms-eval-blog/lmms-eval-0.1/) | 📚 [Documentation](docs/README.md) | 🤗 [Huggingface Datasets](https://huggingface.co/lmms-lab) | Discord_Thread [discord/lmms-eval](https://discord.gg/zdkwKUqrPy) - - -In today's world, we're on an exciting journey toward creating Artificial General Intelligence (AGI), much like the enthusiasm of the 1960s moon landing. This journey is powered by advanced large language models (LLMs) and large multimodal models (LMMs), which are complex systems capable of understanding, learning, and performing a wide variety of human tasks. - -To gauge how advanced these models are, we use a variety of evaluation benchmarks. These benchmarks are tools that help us understand the capabilities of these models, showing us how close we are to achieving AGI. However, finding and using these benchmarks is a big challenge. The necessary benchmarks and datasets are spread out and hidden in various places like Google Drive, Dropbox, and different school and research lab websites. It feels like we're on a treasure hunt, but the maps are scattered everywhere. - -In the field of language models, there has been a valuable precedent set by the work of [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness). They offer integrated data and model interfaces, enabling rapid evaluation of language models and serving as the backend support framework for the [open-llm-leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), and has gradually become the underlying ecosystem of the era of foundation models. - -However, though there are many new evaluation datasets are recently proposed, the efficient evaluation pipeline of LMM is still in its infancy, and there is no unified evaluation framework that can be used to evaluate LMM across a wide range of datasets. To address this challenge, we introduce **lmms-eval**, an evaluation framework meticulously crafted for consistent and efficient evaluation of LMM. - -We humbly obsorbed the exquisite and efficient design of [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness). Building upon its foundation, we implemented our `lmms-eval` framework with performance optimizations specifically for LMMs. - -## Necessity of lmms-eval +🏠 [LMMs-Lab Homepage](https://lmms-lab.github.io/) | 🎉 [Blog](https://lmms-lab.github.io/lmms-eval-blog/lmms-eval-0.1/) | 📚 [Documentation](docs/README.md) | 🤗 [Huggingface Datasets](https://huggingface.co/lmms-lab) | Discord_Thread [discord/lmms-eval](https://discord.gg/zdkwKUqrPy) -We believe our effort could provide an efficient interface for the detailed comparison of publicly available models to discern their strengths and weaknesses. It's also useful for research institutions and production-oriented companies to accelerate the development of large multimodal models. With the `lmms-eval`, we have significantly accelerated the lifecycle of model iteration. Inside the LLaVA team, the utilization of `lmms-eval` largely improves the efficiency of the model development cycle, as we are able to evaluate weekly trained hundreds of checkpoints on 20-30 datasets, identifying the strengths and weaknesses, and then make targeted improvements. +--- # Annoucement -## Contribution Guidance +- [2024-06] 🎬🎬 The `lmms-eval/v0.2` has been upgraded to support video evaluations for video models like LLaVA-NeXT Video and Gemini 1.5 Pro across tasks such as EgoSchema, PerceptionTest, VideoMME, and more. Please refer to the [blog](https://lmms-lab.github.io/posts/lmms-eval-0.2/) for more details -We've added guidance on contributing new datasets and models. Please refer to our [documentation](docs/README.md). If you need assistance, you can contact us via [discord/lmms-eval](https://discord.gg/ebAMGSsS). +- [2024-03] 📝📝 We have released the first version of `lmms-eval`, please refer to the [blog](https://lmms-lab.github.io/posts/lmms-eval-0.1/) for more details -## v0.1.0 Released +# Why `lmms-eval`? -The first version of the `lmms-eval` is released. We are working on providing an one-command evaluation suite for accelerating the development of LMMs. - -> In [LLaVA Next](https://llava-vl.github.io/blog/2024-01-30-llava-next/) development, we internally utilize this suite to evaluate the multiple different model versions on various datasets. It significantly accelerates the model development cycle for it's easy integration and fast evaluation speed. - -The main feature includes: - -

- +

+

-### One-command evaluation, with detailed logs and samples. -You can evaluate the models on multiple datasets with a single command. No model/data preparation is needed, just one command line, few minutes, and get the results. Not just a result number, but also the detailed logs and samples, including the model args, input question, model response, and ground truth answer. - -```python -# Evaluating LLaVA on multiple datasets -accelerate launch --num_processes=8 -m lmms_eval --model llava --model_args pretrained="liuhaotian/llava-v1.5-7b" --tasks mme,mmbench_en --batch_size 1 --log_samples --log_samples_suffix llava_v1.5_mme_mmbenchen --output_path ./logs/ # -``` - -### Accelerator support and Tasks grouping. -We support the usage of `accelerate` to wrap the model for distributed evaluation, supporting multi-gpu and tensor parallelism. With **Task Grouping**, all instances from all tasks are grouped and evaluated in parallel, which significantly improves the throughput of the evaluation. After evaluation, all instances are sent to postprocessing module for metric calcuations and potential GPT4-eval queries. - -Below are the total runtime on different datasets using 4 x A100 40G. - -| Dataset (#num) | LLaVA-v1.5-7b | LLaVA-v1.5-13b | -| :---------------------- | :----------------- | :----------------- | -| mme (2374) | 2 mins 43 seconds | 3 mins 27 seconds | -| gqa (12578) | 10 mins 43 seconds | 14 mins 23 seconds | -| scienceqa_img (2017) | 1 mins 58 seconds | 2 mins 52 seconds | -| ai2d (3088) | 3 mins 17 seconds | 4 mins 12 seconds | -| coco2017_cap_val (5000) | 14 mins 13 seconds | 19 mins 58 seconds | - -### All-In-One HF dataset hubs. - -We are hosting more than 40 (and increasing) datasets on [huggingface/lmms-lab](https://huggingface.co/lmms-lab), we carefully converted these datasets from original sources and included all variants, versions and splits. Now they can be directly accessed without any burden of data preprocessing. They also serve for the purpose of visualizing the data and grasping the sense of evaluation tasks distribution. - -

- -

+In today's world, we're on an exciting journey toward creating Artificial General Intelligence (AGI), much like the enthusiasm of the 1960s moon landing. This journey is powered by advanced large language models (LLMs) and large multimodal models (LMMs), which are complex systems capable of understanding, learning, and performing a wide variety of human tasks. -### Detailed Logging Utilites +To gauge how advanced these models are, we use a variety of evaluation benchmarks. These benchmarks are tools that help us understand the capabilities of these models, showing us how close we are to achieving AGI. -We provide detailed logging utilities to help you understand the evaluation process and results. The logs include the model args, generation parameters, input question, model response, and ground truth answer. You can also record every details and visualize them inside runs on Weights & Biases. +However, finding and using these benchmarks is a big challenge. The necessary benchmarks and datasets are spread out and hidden in various places like Google Drive, Dropbox, and different school and research lab websites. It feels like we're on a treasure hunt, but the maps are scattered everywhere. -{% include figure.liquid loading="eager" path="assets/img/wandb_table.png" class="img-fluid rounded z-depth-1" zoomable=true %} +In the field of language models, there has been a valuable precedent set by the work of [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness). They offer integrated data and model interfaces, enabling rapid evaluation of language models and serving as the backend support framework for the [open-llm-leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), and has gradually become the underlying ecosystem of the era of foundation models. -

- -

+We humbly obsorbed the exquisite and efficient design of [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) and introduce **lmms-eval**, an evaluation framework meticulously crafted for consistent and efficient evaluation of LMM. # Installation @@ -95,37 +47,35 @@ pip install -e . ``` If you wanted to test llava, you will have to clone their repo from [LLaVA](https://github.com/haotian-liu/LLaVA) and -``` -git clone https://github.com/haotian-liu/LLaVA -cd LLaVA +```bash +# for llava 1.5 +# git clone https://github.com/haotian-liu/LLaVA +# cd LLaVA +# pip install -e . + +# for llava-next (1.6) +git clone https://github.com/LLaVA-VL/LLaVA-NeXT +cd LLaVA-NeXT pip install -e . ``` +
+Reproduction of LLaVA-1.5's paper results + You can check the [environment install script](miscs/repr_scripts.sh) and [torch environment info](miscs/repr_torch_envs.txt) to **reproduce LLaVA-1.5's paper results**. We found torch/cuda versions difference would cause small variations in the results, we provide the [results check](miscs/llava_result_check.md) with different environments. +
+ If you want to test on caption dataset such as `coco`, `refcoco`, and `nocaps`, you will need to have `java==1.8.0 ` to let pycocoeval api to work. If you don't have it, you can install by using conda ``` conda install openjdk=8 ``` you can then check your java version by `java -version` -# Usage -```bash -# Evaluating LLaVA on MME -accelerate launch --num_processes=8 -m lmms_eval --model llava --model_args pretrained="liuhaotian/llava-v1.5-7b" --tasks mme --batch_size 1 --log_samples --log_samples_suffix llava_v1.5_mme --output_path ./logs/ - -# Evaluating LLaVA on multiple datasets -accelerate launch --num_processes=8 -m lmms_eval --model llava --model_args pretrained="liuhaotian/llava-v1.5-7b" --tasks mme,mmbench_en --batch_size 1 --log_samples --log_samples_suffix llava_v1.5_mme_mmbenchen --output_path ./logs/ # - -# For other variants llava. Note that `conv_template` is an arg of the init function of llava in `lmms_eval/models/llava.py` -accelerate launch --num_processes=8 -m lmms_eval --model llava --model_args pretrained="liuhaotian/llava-v1.6-mistral-7b,conv_template=mistral_instruct" --tasks mme,mmbench_en --batch_size 1 --log_samples --log_samples_suffix llava_v1.5_mme_mmbenchen --output_path ./logs/ # -accelerate launch --num_processes=8 -m lmms_eval --model llava --model_args pretrained="liuhaotian/llava-v1.6-34b,conv_template=mistral_direct" --tasks mme,mmbench_en --batch_size 1 --log_samples --log_samples_suffix llava_v1.5_mme_mmbenchen --output_path ./logs/ # -# From a predefined configuration, supporting evaluation of multiple models and datasets -accelerate launch --num_processes=8 -m lmms_eval --config example_eval.yaml -``` - -# Model Results +
+Comprehensive Evaluation Results of LLaVA Family Models +
As demonstrated by the extensive table below, we aim to provide detailed information for readers to understand the datasets included in lmms-eval and some specific details about these datasets (we remain grateful for any corrections readers may have during our evaluation process). @@ -137,162 +87,117 @@ We provide a Google Sheet for the detailed results of the LLaVA series models on We also provide the raw data exported from Weights & Biases for the detailed results of the LLaVA series models on different datasets. You can access the raw data [here](https://docs.google.com/spreadsheets/d/1AvaEmuG4csSmXaHjgu4ei1KBMmNNW8wflOD_kkTDdv8/edit?usp=sharing). -> Development will be continuing on the main branch, and we encourage you to give us feedback on what features are desired and how to improve the library further, or ask questions, either in issues or PRs on GitHub. +
+
+ + +Our Development will be continuing on the main branch, and we encourage you to give us feedback on what features are desired and how to improve the library further, or ask questions, either in issues or PRs on GitHub. + +# Multiple Usages +**Evaluation of LLaVA on MME** + +```bash +python3 -m accelerate.commands.launch \ + --num_processes=8 \ + -m lmms_eval \ + --model llava \ + --model_args pretrained="liuhaotian/llava-v1.5-7b" \ + --tasks mme \ + --batch_size 1 \ + --log_samples \ + --log_samples_suffix llava_v1.5_mme \ + --output_path ./logs/ +``` + +**Evaluation of LLaVA on multiple datasets** + +```bash +python3 -m accelerate.commands.launch \ + --num_processes=8 \ + -m lmms_eval \ + --model llava \ + --model_args pretrained="liuhaotian/llava-v1.5-7b" \ + --tasks mme,mmbench_en \ + --batch_size 1 \ + --log_samples \ + --log_samples_suffix llava_v1.5_mme_mmbenchen \ + --output_path ./logs/ +``` + +**For other variants llava. Please change the `conv_template` in the `model_args`** + +> `conv_template` is an arg of the init function of llava in `lmms_eval/models/llava.py`, you could find the corresponding value at LLaVA's code, probably in a dict variable `conv_templates` in `llava/conversations.py` + +```bash +python3 -m accelerate.commands.launch \ + --num_processes=8 \ + -m lmms_eval \ + --model llava \ + --model_args pretrained="liuhaotian/llava-v1.6-mistral-7b,conv_template=mistral_instruct" \ + --tasks mme,mmbench_en \ + --batch_size 1 \ + --log_samples \ + --log_samples_suffix llava_v1.5_mme_mmbenchen \ + --output_path ./logs/ +``` + +**Evaluation of larger lmms (llava-v1.6-34b)** + +```bash +python3 -m accelerate.commands.launch \ + --num_processes=8 \ + -m lmms_eval \ + --model llava \ + --model_args pretrained="liuhaotian/llava-v1.6-34b,conv_template=mistral_direct" \ + --tasks mme,mmbench_en \ + --batch_size 1 \ + --log_samples \ + --log_samples_suffix llava_v1.5_mme_mmbenchen \ + --output_path ./logs/ +``` + +**Evaluation with a set of configurations, supporting evaluation of multiple models and datasets** + +```bash +python3 -m accelerate.commands.launch --num_processes=8 -m lmms_eval --config ./miscs/example_eval.yaml +``` + +**Evaluation with naive model sharding for bigger model (llava-next-72b)** + +```bash +python3 -m lmms_eval \ + --model=llava \ + --model_args=pretrained=lmms-lab/llava-next-72b,conv_template=qwen_1_5,device_map=auto,model_name=llava_qwen \ + --tasks=pope,vizwiz_vqa_val,scienceqa_img \ + --batch_size=1 \ + --log_samples \ + --log_samples_suffix=llava_qwen \ + --output_path="./logs/" \ + --wandb_args=project=lmms-eval,job_type=eval,entity=llava-vl +``` + +**Evaluation with SGLang for bigger model (llava-next-72b)** + +```bash +python3 -m lmms_eval \ + --model=llava_sglang \ + --model_args=pretrained=lmms-lab/llava-next-72b,tokenizer=lmms-lab/llavanext-qwen-tokenizer,conv_template=chatml-llava,tp_size=8,parallel=8 \ + --tasks=mme \ + --batch_size=1 \ + --log_samples \ + --log_samples_suffix=llava_qwen \ + --output_path=./logs/ \ + --verbosity=INFO +``` ## Supported models -- GPT4V (API, only generation-based evaluation) -- LLaVA-v1.5/v1.6-7B/13B/34B (ppl-based, generation-based) -- Qwen-VL series (ppl-based, generation-based) -- Fuyu series (ppl-based, generation-based) -- InstructBLIP series (generation-based) - -## Supported datasets -> () indicates the task name in the lmms_eval. The task name is also used to specify the dataset in the configuration file. - -- AI2D (ai2d) -- ChartQA (chartqa) -- CMMMU (cmmmu) - - CMMMU Validation (cmmmu_val) - - CMMMU Test (cmmmu_test) -- COCO Caption (coco_cap) - - COCO 2014 Caption (coco2014_cap) - - COCO 2014 Caption Validation (coco2014_cap_val) - - COCO 2014 Caption Test (coco2014_cap_test) - - COCO 2017 Caption (coco2017_cap) - - COCO 2017 Caption MiniVal (coco2017_cap_val) - - COCO 2017 Caption MiniTest (coco2017_cap_test) -- [ConBench](https://github.com/foundation-multimodal-models/ConBench) (conbench) -- DOCVQA (docvqa) - - DOCVQA Validation (docvqa_val) - - DOCVQA Test (docvqa_test) -- Ferret (ferret) -- Flickr30K (flickr30k) - - Ferret Test (ferret_test) -- GQA (gqa) -- HallusionBenchmark (hallusion_bench_image) -- Infographic VQA (info_vqa) - - Infographic VQA Validation (info_vqa_val) - - Infographic VQA Test (info_vqa_test) -- LLaVA-Bench (llava_in_the_wild) -- LLaVA-Bench-COCO (llava_bench_coco) -- MathVerse (mathverse) - - MathVerse Text Dominant (mathverse_testmini_text_dominant) - - MathVerse Text Only (mathverse_testmini_text_only) - - MathVerse Text Lite (mathverse_testmini_text_lite) - - MathVerse Vision Dominant (mathverse_testmini_vision_dominant) - - MathVerse Vision Intensive (mathverse_testmini_vision_intensive) - - MathVerse Vision Only (mathverse_testmini_vision_only) -- MathVista (mathvista) - - MathVista Validation (mathvista_testmini) - - MathVista Test (mathvista_test) -- MMBench (mmbench) - - MMBench English (mmbench_en) - - MMBench English Dev (mmbench_en_dev) - - MMBench English Test (mmbench_en_test) - - MMBench Chinese (mmbench_cn) - - MMBench Chinese Dev (mmbench_cn_dev) - - MMBench Chinese Test (mmbench_cn_test) -- MME (mme) -- MMMU (mmmu) - - MMMU Validation (mmmu_val) - - MMMU Test (mmmu_test) -- MMUPD (mmupd) - - MMUPD Base (mmupd_base) - - MMAAD Base (mmaad_base) - - MMIASD Base (mmiasd_base) - - MMIVQD Base (mmivqd_base) - - MMUPD Option (mmupd_option) - - MMAAD Option (mmaad_option) - - MMIASD Option (mmiasd_option) - - MMIVQD Option (mmivqd_option) - - MMUPD Instruction (mmupd_instruction) - - MMAAD Instruction (mmaad_instruction) - - MMIASD Instruction (mmiasd_instruction) - - MMIVQD Instruction (mmivqd_instruction) -- MMVet (mmvet) -- Multi-DocVQA (multidocvqa) - - Multi-DocVQA Validation (multidocvqa_val) - - Multi-DocVQA Test (multidocvqa_test) -- NoCaps (nocaps) - - NoCaps Validation (nocaps_val) - - NoCaps Test (nocaps_test) -- OKVQA (ok_vqa) - - OKVQA Validation 2014 (ok_vqa_val2014) -- POPE (pope) -- RefCOCO (refcoco) - - refcoco_seg - - refcoco_seg_test - - refcoco_seg_val - - refcoco_seg_testA - - refcoco_seg_testB - - refcoco_bbox - - refcoco_bbox_test - - refcoco_bbox_val - - refcoco_bbox_testA - - refcoco_bbox_testB - - refcoco_bbox_rec - - refcoco_bbox_rec_test - - refcoco_bbox_rec_val - - refcoco_bbox_rec_testA - - refcoco_bbox_rec_testB -- RefCOCO+ (refcoco+) - - refcoco+_seg - - refcoco+_seg_val - - refcoco+_seg_testA - - refcoco+_seg_testB - - refcoco+_bbox - - refcoco+_bbox_val - - refcoco+_bbox_testA - - refcoco+_bbox_testB - - refcoco+_bbox_rec - - refcoco+_bbox_rec_val - - refcoco+_bbox_rec_testA - - refcoco+_bbox_rec_testB -- RefCOCOg (refcocog) - - refcocog_seg - - refcocog_seg_test - - refcocog_seg_val - - refcocog_bbox - - refcocog_bbox_test - - refcocog_bbox_val - - refcocog_bbox_rec - - refcocog_bbox_rec_test - - refcocog_bbox_rec_val -- ScienceQA (scienceqa_full) - - ScienceQA Full (scienceqa) - - ScienceQA IMG (scienceqa_img) -- ScreenSpot (screenspot) - - ScreenSpot REC / Grounding (screenspot_rec) - - ScreenSpot REG / Instruction Generation (screenspot_reg) -- SeedBench (seedbench) -- SeedBench 2 (seedbench_2) -- ST-VQA (stvqa) -- TextCaps (textcaps) - - TextCaps Validation (textcaps_val) - - TextCaps Test (textcaps_test) -- TextVQA (textvqa) - - TextVQA Validation (textvqa_val) - - TextVQA Test (textvqa_test) -- VizWizVQA (vizwiz_vqa) - - VizWizVQA Validation (vizwiz_vqa_val) - - VizWizVQA Test (vizwiz_vqa_test) -- VQAv2 (vqav2) - - VQAv2 Validation (vqav2_val) - - VQAv2 Test (vqav2_test) -- WebSRC (websrc) - - WebSRC Validation (websrc_val) - - WebSRC Test (websrc_test) - -## Datasets to be added and tested -- TallyQA (tallyqa) -- VSR (vsr) -- Winoground (winoground) -- NLVR2 (nlvr2) -- RavenIQ-Test (raveniq) -- IconQA (iconqa) -- VistBench (vistbench) +Please check [supported models](lmms_eval/models/__init__.py) for more details. + +## Supported tasks + +Please check [supported tasks](lmms_eval/docs/current_tasks.md) for more details. # Add Customized Model and Dataset @@ -302,14 +207,43 @@ Please refer to our [documentation](docs/README.md). lmms_eval is a fork of [lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness). We recommend you to read through the [docs of lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs) for relevant information. +--- + Below are the changes we made to the original API: - Build context now only pass in idx and process image and doc during the model responding phase. This is due to the fact that dataset now contains lots of images and we can't store them in the doc like the original lm-eval-harness other wise the cpu memory would explode. - Instance.args (lmms_eval/api/instance.py) now contains a list of images to be inputted to lmms. - lm-eval-harness supports all HF language models as single model class. Currently this is not possible of lmms because the input/output format of lmms in HF are not yet unified. Thererfore, we have to create a new class for each lmms model. This is not ideal and we will try to unify them in the future. -We also thank: +--- + +During the initial stage of our project, we thank: - [Xiang Yue](https://xiangyue9607.github.io/), [Jingkang Yang](https://jingkang50.github.io/), [Dong Guo](https://www.linkedin.com/in/dongguoset/) and [Sheng Shen](https://sincerass.github.io/) for early discussion and testing. +--- + +During the `v0.1` to `v0.2`, we thank the community support from pull requests (PRs): + +> Details are in [lmms-eval/v0.2.0 release notes](https://github.com/EvolvingLMMs-Lab/lmms-eval/releases/tag/untagged-9057ff0e9a72d5a5846f) + +**Datasets:** + +- VCR: Vision_Caption_Restoration (officially from the authors, MILA) +- ConBench (officially from the authors, PKU/Bytedance) +- MathVerse (officially from the authors, CUHK) +- MM-UPD (officially from the authors, University of Tokyo) +- Multi-lingual MMMU (officially from the authors, CUHK) +- WebSRC (from Hunter Heiden) +- ScreeSpot (from Hunter Heiden) +- RealworldQA (from Fanyi Pu, NTU) +- Multi-lingual LLaVA-W (from Gagan Bhatia, UBC) + +**Models:** + +- LLaVA-HF (officially from Huggingface) +- Idefics-2 (from the lmms-lab team) +- microsoft/Phi-3-Vision (officially from the authors, Microsoft) +- LLaVA-SGlang (from the lams-lab team) + ## Citations ```shell diff --git a/docs/README.md b/docs/README.md old mode 100644 new mode 100755 diff --git a/docs/commands.md b/docs/commands.md old mode 100644 new mode 100755 diff --git a/docs/current_tasks.md b/docs/current_tasks.md new file mode 100644 index 00000000..1622e960 --- /dev/null +++ b/docs/current_tasks.md @@ -0,0 +1,122 @@ +# Current Tasks + +> () indicates the task name in the lmms_eval. The task name is also used to specify the dataset in the configuration file. +> The following is manually updated documentation. You could use `lmms_eval task --list` to list all supported tasks and their task names. + +- AI2D (ai2d) +- ChartQA (chartqa) +- CMMMU (cmmmu) + - CMMMU Validation (cmmmu_val) + - CMMMU Test (cmmmu_test) +- COCO Caption (coco_cap) + - COCO 2014 Caption (coco2014_cap) + - COCO 2014 Caption Validation (coco2014_cap_val) + - COCO 2014 Caption Test (coco2014_cap_test) + - COCO 2017 Caption (coco2017_cap) + - COCO 2017 Caption MiniVal (coco2017_cap_val) + - COCO 2017 Caption MiniTest (coco2017_cap_test) +- [ConBench](https://github.com/foundation-multimodal-models/ConBench) (conbench) +- DOCVQA (docvqa) + - DOCVQA Validation (docvqa_val) + - DOCVQA Test (docvqa_test) +- Ferret (ferret) +- Flickr30K (flickr30k) + - Ferret Test (ferret_test) +- GQA (gqa) +- HallusionBenchmark (hallusion_bench_image) +- Infographic VQA (info_vqa) + - Infographic VQA Validation (info_vqa_val) + - Infographic VQA Test (info_vqa_test) +- LLaVA-Bench (llava_in_the_wild) +- LLaVA-Bench-COCO (llava_bench_coco) +- MathVerse (mathverse) + - MathVerse Text Dominant (mathverse_testmini_text_dominant) + - MathVerse Text Only (mathverse_testmini_text_only) + - MathVerse Text Lite (mathverse_testmini_text_lite) + - MathVerse Vision Dominant (mathverse_testmini_vision_dominant) + - MathVerse Vision Intensive (mathverse_testmini_vision_intensive) + - MathVerse Vision Only (mathverse_testmini_vision_only) +- MathVista (mathvista) + - MathVista Validation (mathvista_testmini) + - MathVista Test (mathvista_test) +- MMBench (mmbench) + - MMBench English (mmbench_en) + - MMBench English Dev (mmbench_en_dev) + - MMBench English Test (mmbench_en_test) + - MMBench Chinese (mmbench_cn) + - MMBench Chinese Dev (mmbench_cn_dev) + - MMBench Chinese Test (mmbench_cn_test) +- MME (mme) +- MMMU (mmmu) + - MMMU Validation (mmmu_val) + - MMMU Test (mmmu_test) +- MMUPD (mmupd) + - MMUPD Base (mmupd_base) + - MMAAD Base (mmaad_base) + - MMIASD Base (mmiasd_base) + - MMIVQD Base (mmivqd_base) + - MMUPD Option (mmupd_option) + - MMAAD Option (mmaad_option) + - MMIASD Option (mmiasd_option) + - MMIVQD Option (mmivqd_option) + - MMUPD Instruction (mmupd_instruction) + - MMAAD Instruction (mmaad_instruction) + - MMIASD Instruction (mmiasd_instruction) + - MMIVQD Instruction (mmivqd_instruction) +- MMVet (mmvet) +- Multi-DocVQA (multidocvqa) + - Multi-DocVQA Validation (multidocvqa_val) + - Multi-DocVQA Test (multidocvqa_test) +- NoCaps (nocaps) + - NoCaps Validation (nocaps_val) + - NoCaps Test (nocaps_test) +- OKVQA (ok_vqa) + - OKVQA Validation 2014 (ok_vqa_val2014) +- POPE (pope) +- RefCOCO (refcoco) + - refcoco_seg_test + - refcoco_seg_val + - refcoco_seg_testA + - refcoco_seg_testB + - refcoco_bbox_test + - refcoco_bbox_val + - refcoco_bbox_testA + - refcoco_bbox_testB +- RefCOCO+ (refcoco+) + - refcoco+_seg + - refcoco+_seg_val + - refcoco+_seg_testA + - refcoco+_seg_testB + - refcoco+_bbox + - refcoco+_bbox_val + - refcoco+_bbox_testA + - refcoco+_bbox_testB +- RefCOCOg (refcocog) + - refcocog_seg_test + - refcocog_seg_val + - refcocog_bbox_test + - refcocog_bbox_val +- ScienceQA (scienceqa_full) + - ScienceQA Full (scienceqa) + - ScienceQA IMG (scienceqa_img) +- ScreenSpot (screenspot) + - ScreenSpot REC / Grounding (screenspot_rec) + - ScreenSpot REG / Instruction Generation (screenspot_reg) +- SeedBench (seedbench) +- SeedBench 2 (seedbench_2) +- ST-VQA (stvqa) +- TextCaps (textcaps) + - TextCaps Validation (textcaps_val) + - TextCaps Test (textcaps_test) +- TextVQA (textvqa) + - TextVQA Validation (textvqa_val) + - TextVQA Test (textvqa_test) +- VizWizVQA (vizwiz_vqa) + - VizWizVQA Validation (vizwiz_vqa_val) + - VizWizVQA Test (vizwiz_vqa_test) +- VQAv2 (vqav2) + - VQAv2 Validation (vqav2_val) + - VQAv2 Test (vqav2_test) +- WebSRC (websrc) + - WebSRC Validation (websrc_val) + - WebSRC Test (websrc_test) \ No newline at end of file diff --git a/docs/model_guide.md b/docs/model_guide.md old mode 100644 new mode 100755 diff --git a/docs/task_guide.md b/docs/task_guide.md old mode 100644 new mode 100755 index 31fb443d..1376bc22 --- a/docs/task_guide.md +++ b/docs/task_guide.md @@ -27,7 +27,7 @@ doc_to_target: "answer" generation_kwargs: max_new_tokens: 16 temperature: 0 - top_p: 0 + top_p: 1.0 num_beams: 1 do_sample: false # The return value of process_results will be used by metrics diff --git a/example_eval.yaml b/example_eval.yaml deleted file mode 100644 index 40e29a85..00000000 --- a/example_eval.yaml +++ /dev/null @@ -1,15 +0,0 @@ -- model: llava - model_args: pretrained=liuhaotian/llava-v1.5-7b - tasks: ai2d - batch_size: 1 - log_samples: true - log_samples_suffix: eval_vizwiz_vqa - output_path: "./logs/" - -- model: llava - model_args: pretrained=liuhaotian/llava-v1.5-13b - tasks: mme - batch_size: 1 - log_samples: true - log_samples_suffix: mme - output_path: "./logs/" diff --git a/lmms_eval/__init__.py b/lmms_eval/__init__.py old mode 100644 new mode 100755 diff --git a/lmms_eval/__main__.py b/lmms_eval/__main__.py old mode 100644 new mode 100755 index c852d2f4..2949705f --- a/lmms_eval/__main__.py +++ b/lmms_eval/__main__.py @@ -106,9 +106,16 @@ def parse_eval_args() -> argparse.Namespace: parser.add_argument( "--log_samples_suffix", type=str, - default="", + default="model_outputs", help="Specify a suffix for the log_samples file name.", ) + parser.add_argument( + "--predict_only", + "-x", + action="store_true", + default=False, + help="Use with --log_samples. Only model outputs will be saved and metrics will not be evaluated.", + ) parser.add_argument( "--show_config", action="store_true", @@ -228,6 +235,10 @@ def cli_evaluate_single(args: Union[argparse.Namespace, None] = None) -> None: initialize_tasks(args.verbosity) + if args.predict_only: + args.log_samples = True + if (args.log_samples or args.predict_only) and not args.output_path: + raise ValueError("Specify --output_path if providing --log_samples or --predict_only") if args.limit: eval_logger.warning(" --limit SHOULD ONLY BE USED FOR TESTING." "REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.") if args.include_path is not None: @@ -274,6 +285,10 @@ def cli_evaluate_single(args: Union[argparse.Namespace, None] = None) -> None: # set datetime before evaluation datetime_str = utils.get_datetime_str(timezone=args.timezone) if args.output_path: + if args.log_samples_suffix and len(args.log_samples_suffix) > 15: + eval_logger.warning("The suffix for log_samples is too long. It is recommended to keep it under 15 characters.") + args.log_samples_suffix = args.log_samples_suffix[:5] + "..." + args.log_samples_suffix[-5:] + hash_input = f"{args.model_args}".encode("utf-8") hash_output = hashlib.sha256(hash_input).hexdigest()[:6] path = Path(args.output_path) @@ -296,6 +311,7 @@ def cli_evaluate_single(args: Union[argparse.Namespace, None] = None) -> None: log_samples=args.log_samples, gen_kwargs=args.gen_kwargs, cli_args=args, + predict_only=args.predict_only, ) if results is not None: @@ -318,9 +334,9 @@ def cli_evaluate_single(args: Union[argparse.Namespace, None] = None) -> None: for task_name, config in results["configs"].items(): filename = args.output_path.joinpath(f"{task_name}.json") # Structure the data with 'args' and 'logs' keys - data_to_dump = {"args": vars(args), "model_configs": config, "logs": sorted(samples[task_name], key=lambda x: x["doc_id"])} # Convert Namespace to dict - samples_dumped = json.dumps(data_to_dump, indent=4, default=_handle_non_serializable) - filename.open("w").write(samples_dumped) + data_to_dump = {"args": vars(args), "model_configs": config, "logs": sorted(samples[task_name], key=lambda x: x["doc_id"]), "time": datetime_str} + samples_dumped = json.dumps(data_to_dump, indent=4, default=_handle_non_serializable, ensure_ascii=False) + filename.open("w", encoding="utf-8").write(samples_dumped) eval_logger.info(f"Saved samples to {filename}") return results, samples diff --git a/lmms_eval/api/__init__.py b/lmms_eval/api/__init__.py old mode 100644 new mode 100755 diff --git a/lmms_eval/api/filter.py b/lmms_eval/api/filter.py old mode 100644 new mode 100755 diff --git a/lmms_eval/api/instance.py b/lmms_eval/api/instance.py old mode 100644 new mode 100755 diff --git a/lmms_eval/api/metrics.py b/lmms_eval/api/metrics.py old mode 100644 new mode 100755 index 67958f51..c0e5c505 --- a/lmms_eval/api/metrics.py +++ b/lmms_eval/api/metrics.py @@ -16,6 +16,11 @@ # Register Aggregations First +@register_aggregation("bypass") +def bypass_agg(arr): + return 999 + + @register_aggregation("mean") def mean(arr): return sum(arr) / len(arr) @@ -226,6 +231,16 @@ def mean_stderr(arr): return sample_stddev(arr) / math.sqrt(len(arr)) +@register_metric( + metric="bypass", + higher_is_better=True, + output_type=["loglikelihood", "multiple_choice", "generate_until"], + aggregation="bypass", +) +def bypass(items): + return items + + @register_metric( metric="mcc", higher_is_better=True, diff --git a/lmms_eval/api/model.py b/lmms_eval/api/model.py old mode 100644 new mode 100755 diff --git a/lmms_eval/api/registry.py b/lmms_eval/api/registry.py old mode 100644 new mode 100755 index 0728b86d..253341db --- a/lmms_eval/api/registry.py +++ b/lmms_eval/api/registry.py @@ -1,6 +1,8 @@ from lmms_eval.api.model import lmms +from typing import Callable, Dict import logging +import evaluate as hf_evaluate eval_logger = logging.getLogger("lmms-eval") @@ -104,6 +106,22 @@ def decorate(fn): return decorate +def get_metric(name: str, hf_evaluate_metric=False) -> Callable: + if not hf_evaluate_metric: + if name in METRIC_REGISTRY: + return METRIC_REGISTRY[name] + else: + eval_logger.warning(f"Could not find registered metric '{name}' in lm-eval, searching in HF Evaluate library...") + + try: + metric_object = hf_evaluate.load(name) + return metric_object.compute + except Exception: + eval_logger.error( + f"{name} not found in the evaluate library! Please check https://huggingface.co/evaluate-metric", + ) + + def register_aggregation(name): def decorate(fn): assert name not in AGGREGATION_REGISTRY, f"aggregation named '{name}' conflicts with existing registered aggregation!" diff --git a/lmms_eval/api/samplers.py b/lmms_eval/api/samplers.py old mode 100644 new mode 100755 diff --git a/lmms_eval/api/task.py b/lmms_eval/api/task.py old mode 100644 new mode 100755 index 0a58d981..c035a0a2 --- a/lmms_eval/api/task.py +++ b/lmms_eval/api/task.py @@ -1,45 +1,41 @@ import abc -from dataclasses import dataclass, field, asdict - -import itertools -import os -import re import ast +import itertools +import json import logging +import os import random -from tqdm import tqdm +import re +import shutil +import subprocess +from collections.abc import Callable +from dataclasses import dataclass, field, asdict +from glob import glob +from typing import Any, List, Union import datasets -from datasets import Image, Sequence import numpy as np from PIL import ImageFile +from datasets import DownloadConfig, Image, Sequence +from huggingface_hub import snapshot_download +from tenacity import retry, stop_after_attempt, wait_fixed, stop_after_delay +from tqdm import tqdm -from datasets import DownloadConfig -from typing import Union, List, Any -from collections.abc import Callable -from tenacity import retry, stop_after_attempt, wait_fixed - +from accelerate import Accelerator from lmms_eval import utils from lmms_eval.api import samplers from lmms_eval.api.instance import Instance - -from lmms_eval.filters import build_filter_ensemble from lmms_eval.api.registry import ( - get_aggregation, - get_metric_aggregation, - is_higher_better, + AGGREGATION_REGISTRY, DEFAULT_METRIC_REGISTRY, METRIC_REGISTRY, OUTPUT_TYPE_REGISTRY, - AGGREGATION_REGISTRY, + get_aggregation, + get_metric, + get_metric_aggregation, + is_higher_better, ) - -ALL_OUTPUT_TYPES = [ - "loglikelihood", - "multiple_choice", - "generate_until", -] - +from lmms_eval.filters import build_filter_ensemble eval_logger = logging.getLogger("lmms-eval") @@ -47,6 +43,12 @@ # Include this inside code block to avoid error ImageFile.LOAD_TRUNCATED_IMAGES = True +ALL_OUTPUT_TYPES = [ + "loglikelihood", + "multiple_choice", + "generate_until", +] + @dataclass class TaskConfig(dict): @@ -100,7 +102,7 @@ def __post_init__(self) -> None: import inspect from importlib import import_module - self.dataset_path = inspect.getfile(import_module(self.dataset_path)) + # self.dataset_path = inspect.getfile(import_module(self.dataset_path)) if self.generation_kwargs is not None: if self.output_type != "generate_until": @@ -508,6 +510,29 @@ def dump_config(self) -> dict: # (num_fewshot) return self.config.to_dict() + def override_metric(self, metric_name: str) -> None: + """ + Override the default metrics used for evaluation with custom metrics. + + Parameters: + - metric_name (str): The name of the custom metric to override. Should be registered in api.metrics. + """ + ( + self._metric_fn_list, + self._aggregation_list, + self._metric_fn_kwargs, + self._higher_is_better, + ) = ({}, {}, {}, {}) + self._metric_fn_list[metric_name] = get_metric(metric_name) + self._aggregation_list[metric_name] = get_metric_aggregation(metric_name) + self._higher_is_better[metric_name] = is_higher_better(metric_name) + self._metric_fn_kwargs[metric_name] = {} + if not isinstance(self, ConfigurableTask): + self.process_results = lambda x, y: {metric_name: get_metric(metric_name)} + self.aggregation = lambda: {metric_name: get_metric_aggregation(metric_name)} + setattr(self._config, "metric_list", [{"metric": metric_name}]) + setattr(self._config, "process_results", None) + class ConfigurableTask(Task): VERSION = "Yaml" @@ -676,42 +701,127 @@ def _prepare_metric_and_aggregation(self): eval_logger.warning(f"[Task: {self._config.task}] metric {metric_name} is defined, but higher_is_better is not. " f"using default " f"higher_is_better={is_higher_better(metric_name)}") self._higher_is_better[metric_name] = is_higher_better(metric_name) - @retry(stop=stop_after_attempt(5), wait=wait_fixed(2)) + @retry(stop=(stop_after_attempt(5) | stop_after_delay(60)), wait=wait_fixed(2)) def download(self, dataset_kwargs=None) -> None: # If the dataset is a video dataset, # Recursively search whether their is a zip and unzip it to the huggingface home - if dataset_kwargs is not None and "video" in dataset_kwargs and dataset_kwargs["video"]: - hf_home = os.environ["HF_HOME"] - cache_dir = dataset_kwargs["cache_dir"] - dataset_kwargs.pop("cache_dir") - cache_dir = os.path.join(hf_home, cache_dir) - cache_path = snapshot_download(repo_id=self.DATASET_PATH, repo_type="dataset") - zip_files = glob(os.path.join(cache_path, "**/*.zip"), recursive=True) - if not os.path.exists(cache_dir): - for zip_file in zip_files: - shutil.unpack_archive(zip_file, cache_dir) - builder_script = dataset_kwargs["builder_script"] - self.DATASET_PATH = os.path.join(cache_path, builder_script) - dataset_kwargs.pop("video") - dataset_kwargs.pop("builder_script") download_config = DownloadConfig() - download_config.max_retries = dataset_kwargs.get("max_retries", 3) if dataset_kwargs is not None else 3 + download_config.max_retries = dataset_kwargs.get("max_retries", 10) if dataset_kwargs is not None else 10 download_config.num_proc = dataset_kwargs.get("num_proc", 8) if dataset_kwargs is not None else 8 + download_config.local_files_only = dataset_kwargs.get("local_files_only", False) if dataset_kwargs is not None else False + if dataset_kwargs is not None: + if "From_YouTube" in dataset_kwargs: + + def _download_from_youtube(path): + try: + for video in tqdm(self.all_dataset[split]): + video_id = video["videoID"] + target_path = os.path.join(path, f"{video_id}.mp4") + assert shutil.which("yt-dlp") is not None, "yt-dlp must be installed and available in the system's PATH" + command = f"yt-dlp -o {target_path} -f mp4 https://www.youtube.com/watch?v={video_id}" + subprocess.run(command, shell=True) + with open(os.path.join(cache_path, f"{task}_download_status.json"), "w") as f: + f.write(json.dumps({task: "downloaded"})) + except Exception as e: + eval_logger.error(f"Error while downloading {task} data: {e}") + with open(os.path.join(cache_path, f"{task}_download_status.json"), "w") as f: + f.write(json.dumps({task: "not downloaded"})) + + hf_home = os.getenv("HF_HOME", "~/.cache/huggingface/") + accelerator = Accelerator() + if accelerator.is_main_process: + dataset_kwargs.pop("From_YouTube") + self.all_dataset = datasets.load_dataset( + path=self.DATASET_PATH, + name=self.DATASET_NAME, + download_mode=datasets.DownloadMode.REUSE_DATASET_IF_EXISTS, + **dataset_kwargs if dataset_kwargs is not None else {}, + ) + dataset_kwargs["From_YouTube"] = True + cache_path = snapshot_download(repo_id=self.DATASET_PATH, repo_type="dataset") # download_parquet + split = vars(self.config)["test_split"] + task = vars(self.config)["task"] + + video_path = os.path.join(hf_home, task) + if os.path.exists(os.path.join(cache_path, f"{task}_download_status.json")): + download_status = json.load(open(os.path.join(cache_path, f"{task}_download_status.json"), "r")) + if download_status[task] == "downloaded": + eval_logger.info(f"Data for {task} already download!") + else: + eval_logger.info(f"Start downloading YouTube data to {video_path}...") + _download_from_youtube(video_path) + else: + eval_logger.info(f"Start downloading YouTube data to {video_path}...") + _download_from_youtube(video_path) + + accelerator.wait_for_everyone() + if "builder_script" in dataset_kwargs: + builder_script = dataset_kwargs["builder_script"] + self.DATASET_PATH = os.path.join(cache_path, builder_script) + dataset_kwargs.pop("builder_script") + + downloaded_video_ids = [i.split(".mp4")[0] for i in os.listdir(os.path.expanduser(video_path)) if i.endswith(".mp4")] + # Filtered the existing dataset with the downloaded video ids + self.dataset = datasets.DatasetDict({split: self.all_dataset[split].filter(lambda x: x["videoID"] in downloaded_video_ids)}) + + self.dataset_no_image = self.dataset + dataset_kwargs.pop("From_YouTube") + return + + if "video" in dataset_kwargs and dataset_kwargs["video"]: + hf_home = os.getenv("HF_HOME", "~/.cache/huggingface/") + cache_dir = dataset_kwargs["cache_dir"] + cache_dir = os.path.join(hf_home, cache_dir) + accelerator = Accelerator() + if accelerator.is_main_process: + force_download = dataset_kwargs.get("force_download", False) + force_unzip = dataset_kwargs.get("force_unzip", False) + cache_path = snapshot_download(repo_id=self.DATASET_PATH, repo_type="dataset", force_download=force_download, etag_timeout=60) + zip_files = glob(os.path.join(cache_path, "**/*.zip"), recursive=True) + + def unzip_video_data(zip_file): + import zipfile + + with zipfile.ZipFile(zip_file, "r") as zip_ref: + zip_ref.extractall(cache_dir) + eval_logger.info(f"Extracted all files from {zip_file} to {cache_dir}") + + if force_unzip or (not os.path.exists(cache_dir) and len(zip_files) > 0): + for zip_file in zip_files: + unzip_video_data(zip_file) + + accelerator.wait_for_everyone() + dataset_kwargs.pop("cache_dir") + dataset_kwargs.pop("video") + + if "builder_script" in dataset_kwargs: + builder_script = dataset_kwargs["builder_script"] + self.DATASET_PATH = os.path.join(cache_path, builder_script) + dataset_kwargs.pop("builder_script") + + if "force_download" in dataset_kwargs: + dataset_kwargs.pop("force_download") + + if "force_unzip" in dataset_kwargs: + dataset_kwargs.pop("force_unzip") + + if "local_files_only" in dataset_kwargs: + dataset_kwargs.pop("local_files_only") + self.dataset = datasets.load_dataset( path=self.DATASET_PATH, name=self.DATASET_NAME, download_mode=datasets.DownloadMode.REUSE_DATASET_IF_EXISTS, + download_config=download_config, + **dataset_kwargs if dataset_kwargs is not None else {}, + ) + self.dataset_no_image = datasets.load_dataset( + path=self.DATASET_PATH, + name=self.DATASET_NAME, + download_mode=datasets.DownloadMode.REUSE_DATASET_IF_EXISTS, + download_config=download_config, **dataset_kwargs if dataset_kwargs is not None else {}, ) - if self.config.process_docs is not None: - for split in self.dataset: - if split in [ - self.config.training_split, self.config.validation_split, self.config.test_split, self.config.fewshot_split - ]: - self.dataset[split] = self.config.process_docs(self.dataset[split]) - - # copy dataset, remove image features - self.dataset_no_image = self.dataset.copy() for doc_name in self.dataset_no_image: remove_cols = [] features = self.dataset_no_image[doc_name].features @@ -744,14 +854,20 @@ def has_test_docs(self) -> bool: def training_docs(self) -> datasets.Dataset: if self.has_training_docs(): + if self.config.process_docs is not None: + return self.config.process_docs(self.dataset[self.config.training_split]) return self.dataset[self.config.training_split] def validation_docs(self) -> datasets.Dataset: if self.has_validation_docs(): + if self.config.process_docs is not None: + return self.config.process_docs(self.dataset[self.config.validation_split]) return self.dataset[self.config.validation_split] def test_docs(self) -> datasets.Dataset: if self.has_test_docs(): + if self.config.process_docs is not None: + return self.config.process_docs(self.dataset[self.config.test_split]) return self.dataset[self.config.test_split] def fewshot_docs(self): @@ -985,11 +1101,17 @@ def construct_requests(self, doc_id: int, ctx: str, **kwargs) -> Union[List[Inst arguments = (ctx, self.config.generation_kwargs, self.doc_to_visual, doc_id, self.config.task, split) return Instance(request_type=self.OUTPUT_TYPE, arguments=arguments, idx=0, **kwargs) - def process_results(self, doc, results): + # TODO: we add a full_docs interface here for some evaluations that needs to access the full datasets during process_results function. we may have better ways to handle this. + @retry(stop=(stop_after_attempt(5) | stop_after_delay(1200)), wait=wait_fixed(2)) + def process_results(self, doc, results, full_docs=None): if self.OUTPUT_TYPE == "generate_until": results[0] = results[0].strip() + + kwargs = {} + if full_docs is not None: + kwargs["full_docs"] = full_docs if callable(self.config.process_results): - return self.config.process_results(doc, results) + return self.config.process_results(doc, results, **kwargs) result_dict = {} use_metric = list(self._metric_fn_list.keys()) diff --git a/lmms_eval/evaluator.py b/lmms_eval/evaluator.py old mode 100644 new mode 100755 index a97edff0..8a4c49d8 --- a/lmms_eval/evaluator.py +++ b/lmms_eval/evaluator.py @@ -17,6 +17,8 @@ import lmms_eval.api.metrics import lmms_eval.api.registry +import re + from lmms_eval.utils import ( positional_deprecated, run_task_tests, @@ -44,6 +46,7 @@ def simple_evaluate( log_samples: bool = True, gen_kwargs: str = None, cli_args=None, # Bo: put args into more functions (cost 48 Bytes per call) + predict_only: bool = False, ): """Instantiate and evaluate a model on a list of tasks. @@ -111,6 +114,12 @@ def simple_evaluate( if config["output_type"] == "generate_until" and gen_kwargs: config["generation_kwargs"].update(gen_kwargs) + if predict_only: + log_samples = True + eval_logger.info(f"Processing {task_name} in output-only mode. Metrics will not be calculated!") + # we have to change the class properties post-hoc. This is pretty hacky. + task_obj.override_metric(metric_name="bypass") + if num_fewshot is not None: if config["num_fewshot"] == 0: eval_logger.info(f"num_fewshot has been set to 0 for {task_name} in its config. Manual configuration will be ignored.") @@ -285,7 +294,7 @@ def evaluate( cloned_reqs.extend([req] * req.repeats) # run requests through model - resps = getattr(lm, reqtype)(cloned_reqs) + resps = getattr(lm, reqtype)(cloned_reqs) # Choiszt run generate until # put responses from model into a list of length K for each request. for x, req in zip(resps, cloned_reqs): @@ -318,7 +327,7 @@ def evaluate( # hack: remove image columns to speed avoid loading images and speed up postprocessing # reason: doc_iterator will actually load image if it's in the doc. docs = task.test_docs() if task.has_test_docs() else task.validation_docs() - if "d170" not in task_name and "dc100" not in task_name and "dc200" not in task_name: + if "d170" not in task_name and "dc100" not in task_name and "dc200" not in task_name and "llava_wilder" not in task_name and "livebench" not in task_name: remove_cols = [] features = docs.features # If it is an Image instance or a Sequence of Image instance. Remove it @@ -329,6 +338,13 @@ def evaluate( remove_cols.append(feature) if remove_cols: docs = docs.remove_columns(remove_cols) + + ####################### Processing with Full Docs Mode ####################### + if task_name in ["videochatgpt_consistency"]: + full_docs = True + else: + full_docs = False + doc_iterator = itertools.islice(enumerate(docs), lm.rank, limit, lm.world_size) # Instead of converting the iterator to a list, use `itertools.tee` to create a parallel iterator for counting # doc_iterator, doc_iterator_for_counting = itertools.tee(doc_iterator) @@ -340,7 +356,10 @@ def evaluate( # subset instances to only this document id ; sort by idx requests = list(filter(lambda x: x.doc_id == doc_id, task.instances)) requests.sort(key=lambda x: x.idx) - metrics = task.process_results(doc, [req.filtered_resps[key] for req in requests]) + if full_docs: + metrics = task.process_results(doc, [req.filtered_resps[key] for req in requests], full_docs=docs) + else: + metrics = task.process_results(doc, [req.filtered_resps[key] for req in requests]) if log_samples: target = task.doc_to_target(doc) example = { @@ -403,6 +422,8 @@ def evaluate( vals_torch[(task_name, key, metric)] = gathered_item vals = vals_torch + # Ensure all ranks wait for rank 0 to finish aggregation + torch.distributed.barrier() if lm.rank == 0: ### Get task ordering for correct sample-wide aggregation @@ -502,11 +523,22 @@ def evaluate( continue if metric in results[group]: - results[group][metric] = (results[group][metric] * total_size + metric_score * current_size) / (total_size + current_size) - # $$s_z^2 = \frac{(n-1) s_x^2 + (m-1) s_y^2}{n+m-1} + \frac{nm(\bar x - \bar y)^2}{(n+m)(n+m-1)}.$$ - results[group][stderr] = ((total_size - 1) * results[group][stderr] + (current_size - 1) * var_score) / (total_size + current_size - 1) + total_size * current_size / ( - (total_size + current_size) * (total_size + current_size - 1) - ) * (results[group][metric] - metric_score) ** 2 + if isinstance(results[group][metric], str) == False: + results[group][metric] = (results[group][metric] * total_size + metric_score * current_size) / (total_size + current_size) + # $$s_z^2 = \frac{(n-1) s_x^2 + (m-1) s_y^2}{n+m-1} + \frac{nm(\bar x - \bar y)^2}{(n+m)(n+m-1)}.$$ + results[group][stderr] = ((total_size - 1) * results[group][stderr] + (current_size - 1) * var_score) / (total_size + current_size - 1) + total_size * current_size / ( + (total_size + current_size) * (total_size + current_size - 1) + ) * (results[group][metric] - metric_score) ** 2 + else: + # accuracy = re.search(r'acc: ([\d.]+)%', results[group][metric]).group(1) + # score = re.search(r'score: ([\d.]+)', results[group][metric]).group(1) + # group_accuracy = float(accuracy) + # group_score = float(score) + # group_accuracy = (group_accuracy * total_size + metric_score * current_size) / total_size + # group_score = (group_score * total_size + metric_score * current_size) / total_size + # results[group][metric] = "Acc: " + str(group_accuracy) + " Score: " + str(group_score) + results[group][metric] = "group_results" + results[group][stderr] = 0 else: results[group][metric] = metric_score results[group][stderr] = var_score diff --git a/lmms_eval/filters/__init__.py b/lmms_eval/filters/__init__.py old mode 100644 new mode 100755 diff --git a/lmms_eval/filters/decontamination.py b/lmms_eval/filters/decontamination.py old mode 100644 new mode 100755 diff --git a/lmms_eval/filters/extraction.py b/lmms_eval/filters/extraction.py old mode 100644 new mode 100755 index 329d7540..f3045673 --- a/lmms_eval/filters/extraction.py +++ b/lmms_eval/filters/extraction.py @@ -212,3 +212,67 @@ def find_match(self, regex, resp, convert_dict={}): if match and match in convert_dict: match = convert_dict[match] return match + + +# Designed for the AI2D/RealworldQA dataset +class SimpleMultiChoiceRegexFilter(ExtendedRegexFilter): + def __init__(self, *args, **kwargs): + """ + regex_pattern: The basic regex pattern to use. If fails to match, we will use the customized match procedure + - step 1 : We parse the choices between ([A-Z])s then try to find these choices in the response. + - step 2 : We parse the choice with regex :[\s]*([A-?]), where ? varies by number of choices. + group_select: Selects the (group_select)th match from the findall result. + ignore_case: Ignores the case during step 1 matching + ignore_punctuation: Remove the punctuation during step 1 matching + regexes_to_ignore: Remove these regexes during step 1 matching + """ + super().__init__(*args, **kwargs) + + def apply(self, resps, docs): + # here, we assume we have a list, in which each element is + # a list of model responses for some particular input/target pair. + # so we process each of these (same input/target response sets) + # independently (and keep them a list.) + + filtered_resps = [] + + for r, doc in zip(resps, docs): + fallback_regexes = [] + choice_to_alpha = {} + next_alpha = "A" + + without_paren_fallback_regexes = [] + without_paren_to_target = {} + + # Regex to extract multiple choice options from the question + multiple_choices_regex = re.compile(r"\b([A-Z])\.\s+([^\n]*)") + matches = multiple_choices_regex.findall(doc["question"]) + + # Build regex patterns and mappings for each choice + for m in matches: + choice_text = m[1].strip() + fallback_regexes.append(f"{re.escape(choice_text)}") + choice_to_alpha[choice_text] = next_alpha + + next_alpha = chr(ord(next_alpha) + 1) + + # Compile regex to match any of the extracted choices + fallback_regex = re.compile("|".join(fallback_regexes)) + + # Process each response + filtered = [] + for resp in r: + # Remove any punctuation and extra spaces + cleaned_resp = re.sub(r"[^\w\s]", "", resp).strip() + # Try to match cleaned response with the choice text + match = fallback_regex.search(cleaned_resp) + if match and match.group() in choice_to_alpha: + # Map the matched choice text back to its corresponding letter + filtered.append(choice_to_alpha[match.group()]) + else: + # If no match, return the cleaned response + filtered.append(cleaned_resp) + + filtered_resps.append(filtered[0]) + + return filtered_resps diff --git a/lmms_eval/filters/selection.py b/lmms_eval/filters/selection.py old mode 100644 new mode 100755 diff --git a/lmms_eval/filters/transformation.py b/lmms_eval/filters/transformation.py old mode 100644 new mode 100755 diff --git a/lmms_eval/logging_utils.py b/lmms_eval/logging_utils.py old mode 100644 new mode 100755 index 21a2ee04..6107d21b --- a/lmms_eval/logging_utils.py +++ b/lmms_eval/logging_utils.py @@ -89,10 +89,10 @@ def finish(self): def init_run(self): if "name" not in self.wandb_args: if "config" in self.all_args_dict and self.all_args_dict["config"] != "": - self.wandb_args["name"] = self.all_args_dict["config"].split("/")[-1].replace(".yaml", "") + "_" + self.args.log_samples_suffix + self.wandb_args["name"] = self.all_args_dict["config"].split("/")[-1].replace(".yaml", "") + "/" + self.args.log_samples_suffix else: task_names = self.args.tasks.replace(",", "/") - self.wandb_args["name"] = f"{self.args.model}_{task_names}_{self.args.log_samples_suffix}" + self.wandb_args["name"] = f"{self.args.model}/<{task_names}>/{self.args.log_samples_suffix}" if self.args.num_fewshot: self.wandb_args["name"] += f"_{self.args.num_fewshot}shot" if "project" not in self.wandb_args: @@ -119,6 +119,7 @@ def _get_config(self) -> Dict[str, Any]: def _sanitize_results_dict(self) -> Tuple[Dict[str, str], Dict[str, Any]]: """Sanitize the results dictionary.""" _results = copy.deepcopy(self.results.get("results", dict())) + _results["model_configs"] = self.results.get("model_configs", dict()) # Remove None from the metric string name tmp_results = copy.deepcopy(_results) @@ -138,15 +139,18 @@ def _sanitize_results_dict(self) -> Tuple[Dict[str, str], Dict[str, Any]]: if isinstance(metric_value, str): wandb_summary[f"{task}/{metric_name}"] = metric_value + wandb_summary["model_configs"] = self.results.get("model_configs", dict()) for summary_metric, summary_value in wandb_summary.items(): - _task, _summary_metric = summary_metric.split("/") - _results[_task].pop(_summary_metric) + if summary_metric != "model_configs": + _task, _summary_metric = summary_metric.split("/") + _results[_task].pop(_summary_metric) tmp_results = copy.deepcopy(_results) for task_name, task_results in tmp_results.items(): - for metric_name, metric_value in task_results.items(): - _results[f"{task_name}/{metric_name}"] = metric_value - _results[task_name].pop(metric_name) + if task_name != "model_configs": + for metric_name, metric_value in task_results.items(): + _results[f"{task_name}/{metric_name}"] = metric_value + _results[task_name].pop(metric_name) for task in self.task_names: _results.pop(task) diff --git a/lmms_eval/models/__init__.py b/lmms_eval/models/__init__.py old mode 100644 new mode 100755 index 5dbfc7ae..3fe74164 --- a/lmms_eval/models/__init__.py +++ b/lmms_eval/models/__init__.py @@ -1,16 +1,32 @@ import os +import hf_transfer + +os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1" AVAILABLE_MODELS = { "llava": "Llava", - "llava_hf": "LlavaHf", - "llava_sglang": "LlavaSglang", "qwen_vl": "Qwen_VL", "fuyu": "Fuyu", + "batch_gpt4": "BatchGPT4", "gpt4v": "GPT4V", "instructblip": "InstructBLIP", "minicpm_v": "MiniCPM_V", - "idefics2": "Idefics2", + "llava_vid": "LlavaVid", + "videoChatGPT": "VideoChatGPT", + "llama_vid": "LLaMAVid", + "video_llava": "VideoLLaVA", + "xcomposer2_4KHD": "XComposer2_4KHD", + "claude": "Claude", "qwen_vl_api": "Qwen_VL_API", + "llava_sglang": "LlavaSglang", + "idefics2": "Idefics2", + "internvl": "InternVLChat", + "gemini_api": "GeminiAPI", + "gemini_model": "GeminiModel", + "reka": "Reka", + "llava_onevision": "Llava_OneVision", + "from_log": "FromLog", + "mplug_owl_video": "mplug_Owl", "phi3v": "Phi3v", } @@ -19,8 +35,3 @@ exec(f"from .{model_name} import {model_class}") except ImportError: pass - - -import hf_transfer - -os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1" diff --git a/lmms_eval/models/batch_gpt4.py b/lmms_eval/models/batch_gpt4.py new file mode 100755 index 00000000..54bfa149 --- /dev/null +++ b/lmms_eval/models/batch_gpt4.py @@ -0,0 +1,205 @@ +# Standard library imports +from copy import deepcopy +from io import BytesIO +import base64 +import logging +import os +import time +import json + +# Related third-party imports +from accelerate import Accelerator, DistributedType, InitProcessGroupKwargs +from accelerate.state import AcceleratorState +import numpy as np +from PIL import Image +import requests as url_requests +from tqdm import tqdm +from openai import OpenAI + +# Local application/library specific imports +from lmms_eval.api.instance import Instance +from lmms_eval.api.model import lmms +from lmms_eval.api.registry import register_model +from lmms_eval import utils + +# Conditional imports +try: + from decord import VideoReader, cpu +except ImportError: + eval_logger = logging.getLogger("lmms-eval") + eval_logger.info("Decord is not installed. Video input will not be supported.") + +# Constants and global configurations +API_TYPE = os.getenv("API_TYPE", "openai") +NUM_SECONDS_TO_SLEEP = 5 + +if API_TYPE == "openai": + API_URL = os.getenv("OPENAI_API_URL", "https://api.openai.com/v1/chat/completions") + API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_API_KEY") + headers = { + "Authorization": f"Bearer {API_KEY}", + "Content-Type": "application/json", + } +elif API_TYPE == "azure": + API_URL = os.getenv("AZURE_ENDPOINT", "https://api.cognitive.microsoft.com/sts/v1.0/issueToken") + API_KEY = os.getenv("AZURE_API_KEY", "YOUR_API_KEY") + headers = { + "api-key": API_KEY, + "Content-Type": "application/json", + } +else: + API_URL = "YOUR_API_URL" + API_KEY = "YOUR_API_KEY" + + +@register_model("batch_gpt4") +class BatchGPT4(lmms): + def __init__( + self, + model_version: str = "gpt-4o", + api_key: str = API_KEY, + api_url: str = API_URL, + modality: str = "image", + max_frames_for_video: int = 10, + timeout: int = 120, + **kwargs, + ) -> None: + super().__init__() + # Manually set a image token for GPT4V so that we can search for it + # and split the text and image + # Here we just use the same token as llava for convenient + self.model_version = model_version + self.modality = modality + self.max_frames_for_video = max_frames_for_video + self.image_token = "" + self.timeout = timeout + + self.api_key = api_key + self.api_url = api_url + self.client = OpenAI(api_key=api_key) + + accelerator = Accelerator() + assert accelerator.state.local_process_index == 0, "BatchGPT4 does not support distributed inference." + assert accelerator.state.num_processes == 1, "BatchGPT4 does not support distributed inference." + + # Function to encode the image + def encode_image(self, image: Image): + output_buffer = BytesIO() + image.save(output_buffer, format="PNG") + byte_data = output_buffer.getvalue() + base64_str = base64.b64encode(byte_data).decode("utf-8") + return base64_str + + # Function to encode the video + def encode_video(self, video_path, for_get_frames_num): + vr = VideoReader(video_path, ctx=cpu(0)) + total_frame_num = len(vr) + uniform_sampled_frames = np.linspace(0, total_frame_num - 1, for_get_frames_num, dtype=int) + frame_idx = uniform_sampled_frames.tolist() + frames = vr.get_batch(frame_idx).asnumpy() + + base64_frames = [] + for frame in frames: + img = Image.fromarray(frame) + output_buffer = BytesIO() + img.save(output_buffer, format="PNG") + byte_data = output_buffer.getvalue() + base64_str = base64.b64encode(byte_data).decode("utf-8") + base64_frames.append(base64_str) + + return base64_frames + + def flatten(self, input): + new_list = [] + for i in input: + for j in i: + new_list.append(j) + return new_list + + def generate_until(self, requests): + # Prepare the batch requests data + requests_data = {} + pbar = tqdm(total=len(requests), disable=(self.rank != 0), desc="Batch Preparing") + for idx, (contexts, gen_kwargs, doc_to_visual, doc_id, task, split) in enumerate([reg.args for reg in requests]): + visuals = [doc_to_visual(self.task_dict[task][split][doc_id])] + visuals = self.flatten(visuals) + imgs = [] + for visual in visuals: + if self.modality == "image": + img = self.encode_image(visual) + imgs.append(img) + elif self.modality == "video": + frames = self.encode_video(visual, self.max_frames_for_video) + imgs.extend(frames) + + messages = [] + if self.image_token not in contexts: + messages.append({"role": "user", "content": contexts}) + for img in imgs: + messages.append({"role": "user", "content": f"data:image/jpeg;base64,{img}"}) + else: + contexts_split = contexts.split(self.image_token) + for idx, context in enumerate(contexts_split): + if idx < len(imgs): + messages.append({"role": "user", "content": context}) + messages.append({"role": "user", "content": f"data:image/jpeg;base64,{imgs[idx]}"}) + if len(contexts_split) > len(imgs): + messages.append({"role": "user", "content": contexts_split[-1]}) + + requests_data[f"request-{idx}"] = {"model": self.model_version, "messages": messages, "max_tokens": gen_kwargs.get("max_new_tokens", 1024)} + pbar.update(1) + + file_path = os.getenv("HF_HOME", "~/.cache/huggingface") + f"/batchinput_{len(requests_data)}.jsonl" + file_path = self.create_batch_input_file(requests_data, file_path) + file_id = self.upload_input_file(file_path) + + batch_response = self.create_batch(file_id, metadata={"description": "Batch Processing for GPT-4"}) + batch_status = self.check_batch_status(batch_response.id) + while True: + batch_status = self.check_batch_status(batch_response.id) + if batch_status.status == "completed": + eval_logger.info("Batch processing completed.") + batch_results = self.retrieve_batch_results(batch_status.output_file_id) + res = [result["response"]["choices"][0]["message"]["content"] for result in json.loads(batch_results)] + return res + elif batch_status.status == "failed": + eval_logger.info("Batch processing failed.") + res = ["Batch failed"] * len(requests) + return res + else: + eval_logger.info(f"Batch status: {batch_status.status}. Retrying in {NUM_SECONDS_TO_SLEEP} seconds.") + time.sleep(NUM_SECONDS_TO_SLEEP) + + def loglikelihood(self, requests): + # TODO + assert False, "GPT4V not support" + + def create_batch_input_file(self, requests_data, file_path="batchinput.jsonl"): + with open(file_path, "w") as file: + for request_id, data in requests_data.items(): + json_record = json.dumps({"custom_id": request_id, "method": "POST", "url": "/v1/chat/completions", "body": data}) + file.write(json_record + "\n") + return file_path + + def upload_input_file(self, file_path): + with open(file_path, "rb") as file: + response = self.client.files.create(file=file, purpose="batch") + return response.id + + def create_batch(self, file_id, metadata=None): + if metadata is None: + metadata = {} + response = self.client.batches.create(input_file_id=file_id, endpoint="/v1/chat/completions", completion_window="24h", metadata=metadata) + return response + + def check_batch_status(self, batch_id): + return self.client.batches.retrieve(batch_id) + + def retrieve_batch_results(self, file_id): + return self.client.files.content(file_id) + + def cancel_batch(self, batch_id): + return self.client.batches.cancel(batch_id) + + def list_batches(self, limit=10): + return self.client.batches.list(limit=limit) diff --git a/lmms_eval/models/claude.py b/lmms_eval/models/claude.py new file mode 100644 index 00000000..c629ca06 --- /dev/null +++ b/lmms_eval/models/claude.py @@ -0,0 +1,256 @@ +from io import BytesIO +from copy import deepcopy +import os +import base64 +import json +from typing import List, Tuple, Union +from tqdm import tqdm +import requests as url_requests +import time +import logging + +from lmms_eval.api.instance import Instance +from lmms_eval.api.model import lmms +from lmms_eval.api.registry import register_model +from lmms_eval import utils + +from accelerate import Accelerator, DistributedType + +from PIL import Image + +NUM_SECONDS_TO_SLEEP = 5 +eval_logger = logging.getLogger("lmms-eval") + +try: + import anthropic + from decord import VideoReader, cpu + import numpy as np +except Exception as e: + eval_logger.error(f"Error importing claude: {e}") + +API_URL = os.getenv("ANTHROPIC_API_URL", "https://api.anthropic.com/v1/complete") +API_KEY = os.getenv("ANTHROPIC_API_KEY", "YOUR_API_KEY") + + +@register_model("claude") +class Claude(lmms): + def __init__( + self, + model_version: str = "claude-3-opus-20240229", + image_token: str = "", # Use to separate interleaved image and text + system_prompt: str = "", # Whether you want some special system prompt here + modality: str = "image", + continual_mode: bool = False, + response_persistent_folder: str = None, + **kwargs, + ) -> None: + super().__init__() + self.model_version = model_version + self.image_token = image_token + self.system_prompt = system_prompt + self.modality = modality + + self.continual_mode = continual_mode + if self.continual_mode and response_persistent_folder is None: + raise ValueError("Continual mode requires a persistent path for the response. Please provide a valid path.") + self.response_persistent_folder = response_persistent_folder + self.response_persistent_file = os.path.join(self.response_persistent_folder, f"{self.model_version}_response.json") + + if os.path.exists(self.response_persistent_file): + with open(self.response_persistent_file, "r") as f: + self.response_cache = json.load(f) + self.cache_mode = "resume" + else: + self.response_cache = {} + self.cache_mode = "start" + + accelerator = Accelerator() + if accelerator.num_processes > 1: + assert accelerator.distributed_type in [DistributedType.FSDP, DistributedType.MULTI_GPU, DistributedType.DEEPSPEED], "Unsupported distributed type provided. Only DDP and FSDP are supported." + self.accelerator = accelerator + if self.accelerator.is_local_main_process: + eval_logger.info(f"Using {accelerator.num_processes} devices with data parallelism") + self._rank = self.accelerator.local_process_index + self._world_size = self.accelerator.num_processes + else: + self.accelerator = accelerator + self._rank = self.accelerator.local_process_index + self._world_size = self.accelerator.num_processes + + self.device = self.accelerator.device + + def encode_image(self, image): + output_buffer = BytesIO() + image.save(output_buffer, format="PNG") + byte_data = output_buffer.getvalue() + base64_str = base64.b64encode(byte_data).decode("utf-8") + return base64_str + + def flatten(self, input): + new_list = [] + for i in input: + for j in i: + new_list.append(j) + return new_list + + def get_image_size(self, image): + # Create a BytesIO object to store the image bytes + img_byte_array = BytesIO() + + # Save the image to the BytesIO object + image.save(img_byte_array, format="PNG") + + # Get the size of the BytesIO object + img_size = img_byte_array.tell() + + return img_size + + # The max file size is 5MB for claude + def shrink_image_to_file_size(self, img: Image, max_file_size=4838990) -> Image: + # Get the current size of the image + original_size = self.get_image_size(img) + + # If the image size is already smaller than the desired size, return + if original_size <= max_file_size: + return img + + # Calculate the ratio to shrink the image + # Somehow I found out sqrt ratio is not enough to shrink the image + # below threshold, so I guess we do more + shrink_ratio = min(0.9, max_file_size / original_size) + + # Resize the image with the calculated ratio + new_width = int(img.width * shrink_ratio) + new_height = int(img.height * shrink_ratio) + img = img.resize((new_width, new_height), Image.LANCZOS) + + return self.shrink_image_to_file_size(img, max_file_size) + + def encode_video(self, video_path): + vr = VideoReader(video_path, ctx=cpu(0)) + total_frame_num = len(vr) + uniform_sampled_frames = np.linspace(0, total_frame_num - 1, self.max_frames_for_video, dtype=int) + frame_idx = uniform_sampled_frames.tolist() + frames = vr.get_batch(frame_idx).asnumpy() + + base64_frames = [] + for frame in frames: + img = Image.fromarray(frame) + output_buffer = BytesIO() + img.save(output_buffer, format="PNG") + byte_data = output_buffer.getvalue() + base64_str = base64.b64encode(byte_data).decode("utf-8") + base64_frames.append(f"data:image/jpeg;base64,{base64_str}") + + return base64_frames + + def generate_until(self, requests) -> List[str]: + client = anthropic.Anthropic() + + res = [] + pbar = tqdm(total=len(requests), disable=(self.rank != 0), desc="Model Responding") + + empty_image_block = { + "type": "image", + "source": { + "type": "base64", + "media_type": "image/png", + }, + } + empty_text_block = {"type": "text"} + empty_messages = [ + { + "role": "user", + "content": [], + } + ] + + for contexts, gen_kwargs, doc_to_visual, doc_id, task, split in [reg.args for reg in requests]: + ###################### CONTINUAL MODE ###################### + if self.continual_mode is True and self.cache_mode == "resume": + doc_uuid = f"{task}___{split}___{doc_id}" + if doc_uuid in self.response_cache: + response_text = self.response_cache[doc_uuid] + if response_text: + res.append(response_text) + pbar.update(1) + continue + + visuals = [doc_to_visual(self.task_dict[task][split][doc_id])] + visuals = self.flatten(visuals) + imgs = [] + for visual in visuals: + if isinstance(visual, str) and os.path.exists(visual): # Assuming visual is a path to a video + visual = self.encode_video(visual) + for img in visual: + imgs.append(img) + else: + visual = self.shrink_image_to_file_size(visual) + img = self.encode_image(visual) + imgs.append(img) + + messages = deepcopy(empty_messages) + + if self.image_token not in contexts: + for img in imgs: + image_block = deepcopy(empty_image_block) + image_block["source"]["data"] = img + messages[0]["content"].append(image_block) + text_block = deepcopy(empty_text_block) + text_block["text"] = contexts + messages[0]["content"].append(text_block) + else: + contexts = contexts.split(self.image_token) + for idx, img in enumerate(imgs): + text_block = deepcopy(empty_text_block) + image_block = deepcopy(empty_image_block) + text_block["text"] = contexts + messages[0]["content"].append(text_block) + image_block["source"]["data"] = img + messages[0]["content"].append(image_block) + + # If n image tokens are in the contexts + # contexts will be splitted into n+1 chunks + # Manually add it into the messages + text_block = deepcopy(empty_text_block) + text_block["text"] = contexts + messages["content"].append(text_block) + + if "max_new_tokens" not in gen_kwargs: + gen_kwargs["max_new_tokens"] = 1024 + if "temperature" not in gen_kwargs: + gen_kwargs["temperature"] = 0 + if "top_p" not in gen_kwargs: + gen_kwargs["top_p"] = None + if "num_beams" not in gen_kwargs: + gen_kwargs["num_beams"] = 1 + + for attempt in range(5): + try: + message = client.messages.create(model=self.model_version, max_tokens=gen_kwargs["max_new_tokens"], system=self.system_prompt, temperature=gen_kwargs["temperature"], top_p=gen_kwargs["top_p"], messages=messages) + except Exception as e: + eval_logger.info(f"Attempt {attempt + 1} failed with error: {str(e)}") + if attempt < 5 - 1: # If we have retries left, sleep and then continue to next attempt + time.sleep(NUM_SECONDS_TO_SLEEP) + else: # If this was the last attempt, log and return empty + eval_logger.error(f"All 5 attempts failed. Last error message: {str(e)}") + res.append("") + pbar.update(1) + continue + + res.append(message.content[0].text) + pbar.update(1) + + ###################### CONTINUAL MODE ###################### + if self.continual_mode is True: # Cache the response + doc_uuid = f"{task}___{split}___{doc_id}" + self.response_cache[doc_uuid] = response_text + with open(self.response_persistent_file, "w") as f: + json.dump(self.response_cache, f) + + pbar.close() + + return res + + def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]: + assert False, "Not supported for claude" diff --git a/lmms_eval/models/from_log.py b/lmms_eval/models/from_log.py new file mode 100644 index 00000000..4c573e0f --- /dev/null +++ b/lmms_eval/models/from_log.py @@ -0,0 +1,117 @@ +import logging +import json +import os +import re + +from datetime import datetime +from typing import List, Tuple +from tqdm import tqdm +from lmms_eval.api.registry import register_model +from lmms_eval.api.model import lmms +from lmms_eval.api.instance import Instance +from accelerate import Accelerator, DistributedType + +eval_logger = logging.getLogger("lmms-eval") + + +@register_model("from_log") +class FromLog(lmms): + def __init__( + self, + logs: str = "logs", + model_name: str = None, + model_args: str = None, + have_limits: bool = False, + **kwargs, + ) -> None: + super().__init__() + + self.logs = {} + + log_folders = logs.split(",") + + def matched_model(_model_args): + if model_name and model_name != _model_args["model"]: + return False + + if model_args: + _model_args_list = model_args.split(",") + + for _model_arg in _model_args_list: + if _model_arg not in _model_args["model_args"]: + return False + + if not have_limits and _model_args["limit"] is not None: + return False + + return True + + for log_folder in log_folders: + for root, dirs, files in os.walk(log_folder): + for file in files: + if file.endswith(".json"): + try: + log_file = os.path.join(root, file) + + with open(log_file, "r") as f: + log_data = json.load(f) + + # check if model is matched + _model_args = log_data["args"] + if not matched_model(_model_args): + raise Exception("Model not matched") + + # load logs + logs = {} + for data in log_data["logs"]: + id = data["doc_id"] + response = data["resps"][0] + logs[id] = response + + task = log_data["model_configs"]["task"] + + pattern = re.compile(r"\d{4}_\d{4}") + + if "time" in log_data: + log_time = log_data["time"] + elif pattern.search(os.path.abspath(log_file)): + log_time = pattern.findall(os.path.abspath(log_file))[-1] + else: + log_time = "unknown" + + if task not in self.logs or (self.logs[task]["time"] == "unknown" or datetime.strptime(log_time, "%m%d_%H%M") > datetime.strptime(self.logs[task]["time"], "%m%d_%H%M")): + self.logs[task] = {"time": log_time, "logs": logs} + + except Exception as e: + pass + + accelerator = Accelerator() + if accelerator.num_processes > 1: + assert accelerator.distributed_type in [DistributedType.FSDP, DistributedType.MULTI_GPU, DistributedType.DEEPSPEED], "Unsupported distributed type provided. Only DDP and FSDP are supported." + self.accelerator = accelerator + if self.accelerator.is_local_main_process: + eval_logger.info(f"Using {accelerator.num_processes} devices with data parallelism") + self._rank = self.accelerator.local_process_index + self._world_size = self.accelerator.num_processes + else: + self.accelerator = accelerator + self._rank = self.accelerator.local_process_index + self._world_size = self.accelerator.num_processes + + self.device = self.accelerator.device + + def generate_until(self, requests) -> List[str]: + res = [] + pbar = tqdm(total=len(requests), disable=(self.rank != 0), desc="Model Responding") + + for contexts, gen_kwargs, doc_to_visual, doc_id, task, split in [reg.args for reg in requests]: + response = self.logs[task]["logs"][doc_id] + res.append(response[0]) + pbar.update(1) + + pbar.close() + return res + + def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]: + # TODO + assert False, "not support" diff --git a/lmms_eval/models/fuyu.py b/lmms_eval/models/fuyu.py old mode 100644 new mode 100755 index 7a4844dc..5960d69e --- a/lmms_eval/models/fuyu.py +++ b/lmms_eval/models/fuyu.py @@ -21,8 +21,6 @@ eval_logger = logging.getLogger("lmms-eval") -eval_logger = logging.getLogger("lmms-eval") - @register_model("fuyu") class Fuyu(lmms): @@ -85,7 +83,7 @@ def __init__( self._rank = 0 self._word_size = 1 - '''if accelerator.num_processes > 1: + """if accelerator.num_processes > 1: assert accelerator.distributed_type in [ DistributedType.FSDP, DistributedType.MULTI_GPU, @@ -98,7 +96,7 @@ def __init__( if self.accelerator.is_local_main_process: eval_logger.info(f"Using {accelerator.num_processes} devices with data parallelism") self._rank = self.accelerator.local_process_index - self._world_size = self.accelerator.num_processes''' + self._world_size = self.accelerator.num_processes""" @property def config(self): @@ -204,7 +202,7 @@ def _collate(x): # generation_output = self.model.generate( # **model_inputs, temperature=gen_kwargs["temperature"], max_new_tokens=gen_kwargs["max_new_tokens"], top_p=gen_kwargs["top_p"], num_beams=gen_kwargs["num_beams"], pad_token_id=self.tokenizer.eos_token_id # ) - generation_output = self.model.generate(**model_inputs, max_new_tokens=gen_kwargs["max_new_tokens"]) + generation_output = self.model.generate(**model_inputs, max_new_tokens=gen_kwargs["max_new_tokens"], pad_token_id=self.tokenizer.eos_token_id) generation_texts = self.processor.batch_decode(generation_output, skip_special_tokens=True) response = [gen_text.split("\x04")[1].strip(" ").strip("\n") for gen_text in generation_texts] res.extend(response) diff --git a/lmms_eval/models/gemini_api.py b/lmms_eval/models/gemini_api.py new file mode 100644 index 00000000..0b2be05e --- /dev/null +++ b/lmms_eval/models/gemini_api.py @@ -0,0 +1,185 @@ +import io +import os +import time +import logging +import json + +from PIL import Image +from typing import List, Tuple +from tqdm import tqdm +from lmms_eval.api.registry import register_model +from lmms_eval.api.model import lmms +from lmms_eval.api.instance import Instance +from accelerate import Accelerator, DistributedType + +eval_logger = logging.getLogger("lmms-eval") + +try: + import google.generativeai as genai + from google.generativeai.types import HarmCategory, HarmBlockThreshold + + NUM_SECONDS_TO_SLEEP = 30 + GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY") + genai.configure(api_key=GOOGLE_API_KEY) + +except Exception as e: + eval_logger.error(f"Error importing generativeai: {str(e)}") + genai = None + + +@register_model("gemini_api") +class GeminiAPI(lmms): + def __init__( + self, + model_version: str = "gemini-1.5-flash-latest", + modality: str = "image", + timeout: int = 120, + continual_mode: bool = False, + response_persistent_folder: str = None, # We will cache the Gemini API response in this path and use it for future requests + **kwargs, + ) -> None: + super().__init__() + self.model_version = model_version + self.timeout = timeout + self.model = genai.GenerativeModel(model_version) + self.continual_mode = continual_mode + if self.continual_mode and response_persistent_folder is None: + raise ValueError("Continual mode requires a persistent path for the response. We will cache the Gemini API response in this path and use it for future requests. Please provide a valid path.") + self.response_persistent_folder = response_persistent_folder + self.response_persistent_file = os.path.join(self.response_persistent_folder, f"{self.model_version}_response.json") + + if os.path.exists(self.response_persistent_file): + with open(self.response_persistent_file, "r") as f: + self.response_cache = json.load(f) + self.cache_mode = "resume" + else: + self.response_cache = {} + self.cache_mode = "start" + + accelerator = Accelerator() + if accelerator.num_processes > 1: + assert self.continual_mode is False, "Continual mode is not supported with distributed inference." + assert accelerator.distributed_type in [DistributedType.FSDP, DistributedType.MULTI_GPU, DistributedType.DEEPSPEED], "Unsupported distributed type provided. Only DDP and FSDP are supported." + self.accelerator = accelerator + if self.accelerator.is_local_main_process: + eval_logger.info(f"Using {accelerator.num_processes} devices with data parallelism") + self._rank = self.accelerator.local_process_index + self._world_size = self.accelerator.num_processes + else: + self.accelerator = accelerator + self._rank = self.accelerator.local_process_index + self._world_size = self.accelerator.num_processes + + self.device = self.accelerator.device + + self.modality = modality + + def flatten(self, input): + new_list = [] + for i in input: + for j in i: + new_list.append(j) + return new_list + + def get_image_size(self, image): + # Create a BytesIO object to store the image bytes + img_byte_array = io.BytesIO() + + # Save the image to the BytesIO object + image.save(img_byte_array, format="PNG") + + # Get the size of the BytesIO object + img_size = img_byte_array.tell() + + return img_size + + def encode_video(self, video_path): + uploaded_obj = genai.upload_file(path=video_path) + time.sleep(5) + return uploaded_obj + + def convert_video(self, images): + for idx, img in enumerate(images): + if self.modality == "video" and isinstance(img, str): + try: + images[idx] = self.encode_video(img) + except Exception as e: + eval_logger.error(f"Error converting video: {str(e)}") + return images + + def generate_until(self, requests) -> List[str]: + res = [] + pbar = tqdm(total=len(requests), disable=(self.rank != 0), desc="Model Responding") + + def get_uuid(task, split, doc_id): + return f"{task}___{split}___{doc_id}" + + for contexts, gen_kwargs, doc_to_visual, doc_id, task, split in [reg.args for reg in requests]: + if self.continual_mode is True and self.cache_mode == "resume": + doc_uuid = get_uuid(task, split, doc_id) + if doc_uuid in self.response_cache: + content = self.response_cache[doc_uuid] + if content: + res.append(content) + pbar.update(1) + continue + + if "max_new_tokens" not in gen_kwargs: + gen_kwargs["max_new_tokens"] = 1024 + if "temperature" not in gen_kwargs: + gen_kwargs["temperature"] = 0 + + config = genai.GenerationConfig( + max_output_tokens=gen_kwargs["max_new_tokens"], + temperature=gen_kwargs["temperature"], + ) + + visuals = [doc_to_visual(self.task_dict[task][split][doc_id])] + visuals = self.flatten(visuals) + visuals = self.convert_video(visuals) + + message = [contexts] + visuals + + for attempt in range(5): + try: + content = self.model.generate_content( + message, + generation_config=config, + safety_settings={ + HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE, + HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_NONE, + HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_NONE, + HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_NONE, + }, + ) + content = content.text + break + except Exception as e: + eval_logger.info(f"Attempt {attempt + 1} failed with error: {str(e)}") + if isinstance(e, ValueError): + try: + eval_logger.info(f"Prompt feed_back: {content.prompt_feedback}") + content = "" + break + except Exception: + pass + if attempt < 5 - 1: # If we have retries left, sleep and then continue to next attempt + time.sleep(NUM_SECONDS_TO_SLEEP) + else: # If this was the last attempt, log and return empty + eval_logger.error(f"All 5 attempts failed. Last error message: {str(e)}") + content = "" + res.append(content) + pbar.update(1) + + if self.continual_mode is True: # Cache the response + doc_uuid = get_uuid(task, split, doc_id) + self.response_cache[doc_uuid] = content + with open(self.response_persistent_file, "w") as f: + json.dump(self.response_cache, f) + + pbar.close() + return res + + def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]: + # TODO + assert False, "Gemini API not support" diff --git a/lmms_eval/models/gpt4v.py b/lmms_eval/models/gpt4v.py old mode 100644 new mode 100755 index 0194f8ed..d28c1b00 --- a/lmms_eval/models/gpt4v.py +++ b/lmms_eval/models/gpt4v.py @@ -1,5 +1,6 @@ from io import BytesIO from copy import deepcopy +import numpy as np import os import base64 from typing import List, Tuple @@ -13,10 +14,18 @@ from lmms_eval.api.registry import register_model from lmms_eval import utils +from accelerate import Accelerator, DistributedType, InitProcessGroupKwargs +from accelerate.state import AcceleratorState + +try: + from decord import VideoReader, cpu +except ImportError: + pass + from PIL import Image API_TYPE = os.getenv("API_TYPE", "openai") -NUM_SECONDS_TO_SLEEP = 5 +NUM_SECONDS_TO_SLEEP = 30 eval_logger = logging.getLogger("lmms-eval") if API_TYPE == "openai": @@ -40,6 +49,9 @@ class GPT4V(lmms): def __init__( self, model_version: str = "gpt-4-vision-preview", + modality: str = "video", + max_frames_for_video: int = 10, + timeout: int = 120, **kwargs, ) -> None: super().__init__() @@ -47,7 +59,26 @@ def __init__( # and split the text and image # Here we just use the same token as llava for convenient self.model_version = model_version + self.modality = modality + self.max_frames_for_video = max_frames_for_video self.image_token = "" + self.timeout = timeout + + accelerator = Accelerator() + # assert self.batch_size_per_gpu == 1, "Llava currently does not support batched generation. See https://github.com/haotian-liu/LLaVA/issues/754. HF Llava also has this issue." + if accelerator.num_processes > 1: + assert accelerator.distributed_type in [DistributedType.FSDP, DistributedType.MULTI_GPU, DistributedType.DEEPSPEED], "Unsupported distributed type provided. Only DDP and FSDP are supported." + self.accelerator = accelerator + if self.accelerator.is_local_main_process: + eval_logger.info(f"Using {accelerator.num_processes} devices with data parallelism") + self._rank = self.accelerator.local_process_index + self._world_size = self.accelerator.num_processes + else: + self.accelerator = accelerator + self._rank = self.accelerator.local_process_index + self._world_size = self.accelerator.num_processes + + self.device = self.accelerator.device # Function to encode the image def encode_image(self, image: Image): @@ -57,6 +88,25 @@ def encode_image(self, image: Image): base64_str = base64.b64encode(byte_data).decode("utf-8") return base64_str + # Function to encode the video + def encode_video(self, video_path, for_get_frames_num): + vr = VideoReader(video_path, ctx=cpu(0)) + total_frame_num = len(vr) + uniform_sampled_frames = np.linspace(0, total_frame_num - 1, for_get_frames_num, dtype=int) + frame_idx = uniform_sampled_frames.tolist() + frames = vr.get_batch(frame_idx).asnumpy() + + base64_frames = [] + for frame in frames: + img = Image.fromarray(frame) + output_buffer = BytesIO() + img.save(output_buffer, format="PNG") + byte_data = output_buffer.getvalue() + base64_str = base64.b64encode(byte_data).decode("utf-8") + base64_frames.append(base64_str) + + return base64_frames + def flatten(self, input): new_list = [] for i in input: @@ -70,12 +120,17 @@ def generate_until(self, requests) -> List[str]: for contexts, gen_kwargs, doc_to_visual, doc_id, task, split in [reg.args for reg in requests]: # encode, pad, and truncate contexts for this batch - visuals = [doc_to_visual(self.task_dict[task][split][doc_id])] + # visuals = [doc_to_visual(self.task_dict[task][split][doc_id])] + visuals = [doc_to_visual(self.task_dict[task][split][0])] visuals = self.flatten(visuals) - imgs = [] + imgs = [] # multiple images or frames for video for visual in visuals: - img = self.encode_image(visual) - imgs.append(img) + if self.modality == "image": + img = self.encode_image(visual) + imgs.append(img) + elif self.modality == "video": + frames = self.encode_video(visual, self.max_frames_for_video) + imgs.extend(frames) payload = {"model": self.model_version, "messages": []} response_json = {"role": "user", "content": []} @@ -107,12 +162,12 @@ def generate_until(self, requests) -> List[str]: if "num_beams" not in gen_kwargs: gen_kwargs["num_beams"] = 1 - # payload["max_tokens"] = gen_kwargs["max_new_tokens"] - # payload["temperature"] = gen_kwargs["temperature"] + payload["max_tokens"] = gen_kwargs["max_new_tokens"] + payload["temperature"] = gen_kwargs["temperature"] for attempt in range(5): try: - response = url_requests.post(API_URL, headers=headers, json=payload, timeout=20) + response = url_requests.post(API_URL, headers=headers, json=payload, timeout=self.timeout) response_data = response.json() content = response_data["choices"][0]["message"]["content"].strip() @@ -124,9 +179,11 @@ def generate_until(self, requests) -> List[str]: time.sleep(NUM_SECONDS_TO_SLEEP) else: # If this was the last attempt, log and return empty eval_logger.error(f"All 5 attempts failed. Last error message: {str(e)}") + eval_logger.error(f"Response: {response}") content = "" res.append(content) pbar.update(1) + pbar.close() return res def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]: diff --git a/lmms_eval/models/idefics2.py b/lmms_eval/models/idefics2.py index 7419472b..9e274c8a 100644 --- a/lmms_eval/models/idefics2.py +++ b/lmms_eval/models/idefics2.py @@ -17,12 +17,14 @@ eval_logger = logging.getLogger("lmms-eval") DEFAULT_IMAGE_TOKEN = "" -try: +try: import flash_attn + best_fit_attn_implementation = "flash_attention_2" except ImportError: best_fit_attn_implementation = "eager" + @register_model("idefics2") class Idefics2(lmms): """ @@ -50,7 +52,7 @@ def __init__( attn_implementation: Optional[str] = best_fit_attn_implementation, device_map: str = "", use_cache: bool = True, - do_image_splitting: bool =False, + do_image_splitting: bool = False, **kwargs, ) -> None: super().__init__() @@ -194,9 +196,14 @@ def _collate(x): # we assume all gen kwargs in the batch are the same # this is safe to assume because the `grouper` object ensures it. gen_kwargs = all_gen_kwargs[0] - # + # until = gen_kwargs.pop("until", None) - image_aspect_ratio = gen_kwargs.pop("image_aspect_ratio", None) + image_aspect_ratio = gen_kwargs.pop("image_aspect_ratio", None) + if "max_new_tokens" not in gen_kwargs: + gen_kwargs["max_new_tokens"] = 1024 + if "temperature" not in gen_kwargs: + gen_kwargs["temperature"] = 0 + prompts = [] for context, visual in zip(contexts, visuals): content = [] @@ -212,9 +219,9 @@ def _collate(x): output_ids = self.model.generate(**inputs, **gen_kwargs) # only retain the generated text for output_id, input_id in zip(output_ids, inputs["input_ids"]): - generated_id = output_id[len(input_id):] + generated_id = output_id[len(input_id) :] generated_text = self.tokenizer.decode(generated_id, skip_special_tokens=True) - + res.append(generated_text) pbar.update(1) # reorder this group of results back to original unsorted form diff --git a/lmms_eval/models/instructblip.py b/lmms_eval/models/instructblip.py old mode 100644 new mode 100755 index 2f065ffe..3ca068ed --- a/lmms_eval/models/instructblip.py +++ b/lmms_eval/models/instructblip.py @@ -10,6 +10,7 @@ from accelerate import Accelerator, DistributedType from accelerate.state import AcceleratorState from typing import List, Optional, Union, Tuple +import transformers from transformers import InstructBlipProcessor, InstructBlipForConditionalGeneration from lmms_eval.utils import stop_sequences_criteria @@ -20,6 +21,7 @@ warnings.filterwarnings("ignore") eval_logger = logging.getLogger("lmms-eval") +transformers.logging.set_verbosity_error() @register_model("instructblip") diff --git a/lmms_eval/models/internvl.py b/lmms_eval/models/internvl.py new file mode 100644 index 00000000..d808081a --- /dev/null +++ b/lmms_eval/models/internvl.py @@ -0,0 +1,485 @@ +import logging +import os +from accelerate import Accelerator, DistributedType, InitProcessGroupKwargs +from accelerate.state import AcceleratorState +from typing import List, Optional, Union, Tuple +import torch +from tqdm import tqdm +import numpy as np +import math +from datetime import timedelta +from transformers import AutoConfig +from huggingface_hub import snapshot_download +import requests + +from lmms_eval import utils +from lmms_eval.api.instance import Instance +from lmms_eval.api.model import lmms +from lmms_eval.api.registry import register_model +from lmms_eval.utils import stop_sequences_criteria +from PIL import Image + +import subprocess +from pathlib import Path + +wd = Path(__file__).parent.parent.parent.resolve() +import sys + +sys.path.append(os.path.join(str(wd), "InternVL", "internvl_chat")) +eval_logger = logging.getLogger("lmms-eval") + +if not hasattr(eval_logger, "internvl_warning_logged"): + eval_logger.internvl_warning_logged = False + +try: + from internvl.model.internlm2.modeling_internlm2 import InternLM2ForCausalLM + from internvl.model.internvl_chat.configuration_internvl_chat import InternVLChatConfig + from internvl.model.internvl_chat.modeling_intern_vit import InternVisionModel + from internvl.model.internvl_chat import InternVLChatModel + from internvl.train.dataset import build_transform, dynamic_preprocess +except ImportError: + eval_logger.debug("InternVL is not installed. Please install InternVL to use this model.") + if not eval_logger.internvl_warning_logged: + eval_logger.debug("InternVL is not installed. Please install InternVL to use this model.") + eval_logger.internvl_warning_logged = True + +import warnings +from typing import Any, List, Optional, Tuple, Union + +import torch.utils.checkpoint + +from peft import LoraConfig, get_peft_model +from torch import nn +from torch.nn import CrossEntropyLoss +from transformers import AutoModel, GenerationConfig, LlamaForCausalLM, LlamaTokenizer +from transformers.modeling_outputs import CausalLMOutputWithPast +from transformers.modeling_utils import PreTrainedModel +from transformers import AutoTokenizer +import re +from huggingface_hub import snapshot_download + + +@register_model("internvl") +class InternVLChat(lmms): + # config_class = InternVLChatConfig + main_input_name = "pixel_values" + _no_split_modules = ["InternVisionEncoderLayer", "LlamaDecoderLayer"] + + """ + 0. Install lmms-eval + cd lmms-eval + pip install -e . + + How to Install InternVL: + 1. Clone the InternVL repository: + git clone https://github.com/OpenGVLab/InternVL.git + + 2. Install the requirements: + pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118 + + 3. Install flash-attn==2.3.6: + pip install flash-attn==2.3.6 --no-build-isolation + """ + + """ + How to download the pretrained model: + 1. Download the pretrained model from hugginface: + cd pretrained/ + # pip install -U huggingface_hub + huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL-Chat-V1-5 --local-dir InternVL-Chat-V1-5 + + 2. the pretrained model should be in the following directory: + pretrained + └── InternVL-Chat-V1-5 + """ + + # + # The above steps can be optional, I add snapshot download, so now can just use hf repo_id + # model_args pretrained=OpenGVLab/InternVL-Chat-V1-5 + # + + """ + InternVL-Chat-V1-5 Model for OpenGVLab https://github.com/OpenGVLab/InternVL/blob/main/internvl_chat/internvl/model/internvl_chat/modeling_internvl_chat.py + Example usage: + + accelerate launch --num_processes=8 --main_process_port 12345 -m lmms_eval \ + --model internvl \ + --model_args pretrained=OpenGVLab/InternVL-Chat-V1-5 \ + --tasks llava_wilder_small \ + --batch_size 1 \ + --output_path ./logs/ \ + --log_samples + """ + + def __init__( + self, + config=None, + pretrained: str = "OpenGVLab/InternVL-Chat-V1-5", + truncation: Optional[bool] = True, + device: Optional[str] = "cuda:0", + dtype: Optional[Union[str, torch.dtype]] = "auto", + batch_size: Optional[Union[int, str]] = 1, + trust_remote_code: Optional[bool] = False, + revision=None, + device_map="cuda:0", + conv_template="vicuna_v1", + use_cache=True, + truncate_context=False, # whether to truncate the context in generation, set it False for LLaVA-1.6 + customized_config=None, # ends in json + dynamic=True, + load_in_8bit=False, + vision_model=None, + language_model=None, + max_num=12, + **kwargs, + ) -> None: + super().__init__() + + assert kwargs == {}, f"Unexpected kwargs: {kwargs}" + + accelerator_kwargs = InitProcessGroupKwargs(timeout=timedelta(weeks=52)) + accelerator = Accelerator(kwargs_handlers=[accelerator_kwargs]) + if accelerator.num_processes > 1: + self._device = torch.device(f"cuda:{accelerator.local_process_index}") + self.device_map = f"cuda:{accelerator.local_process_index}" + elif accelerator.num_processes == 1 and device_map == "auto": + self._device = torch.device(device) + self.device_map = device_map + else: + self._device = torch.device(f"cuda:{accelerator.local_process_index}") + self.device_map = f"cuda:{accelerator.local_process_index}" + + self.dynamic = dynamic # dynamic image_size + self.max_num = max_num + if accelerator.is_main_process: + cache_dir = snapshot_download(repo_id=pretrained, cache_dir="cache_dir", local_dir="cache_dir", local_dir_use_symlinks=False) + accelerator.wait_for_everyone() + # So what I did is that I let main process to download the repo, and then + # other process can just simply read from this repo + cache_dir = snapshot_download(repo_id=pretrained, cache_dir="cache_dir", local_dir="cache_dir", local_dir_use_symlinks=False) + config = InternVLChatConfig.from_pretrained(cache_dir) + tokenizer = AutoTokenizer.from_pretrained(cache_dir, trust_remote_code=True, use_fast=False) + model = InternVLChatModel.from_pretrained(cache_dir, low_cpu_mem_usage=True, config=config, torch_dtype=torch.bfloat16, load_in_8bit=load_in_8bit).eval() + if not load_in_8bit: + model = model.cuda() + # self.model=model + # self.device=self._device + self._tokenizer = tokenizer + # self.tokenizer=tokenizer + self._model = model + self._config = self._model.config + self.use_thumbnail = self.model.config.use_thumbnail + self.model.eval() + self.model.tie_weights() + self.truncation = truncation + self.batch_size_per_gpu = int(batch_size) + self.conv_template = conv_template + self.use_cache = use_cache + self.truncate_context = truncate_context + if accelerator.num_processes > 1: + assert accelerator.distributed_type in [DistributedType.FSDP, DistributedType.MULTI_GPU, DistributedType.DEEPSPEED], "Unsupported distributed type provided. Only DDP and FSDP are supported." + # If you want to use DistributedType.DEEPSPEED, you have to run accelerate config before using the model + # Also, you have to select zero stage 0 (equivalent to DDP) in order to make the prepare model works + # I tried to set different parameters in the kwargs to let default zero 2 stage works, but it didn't work. + if accelerator.distributed_type == DistributedType.DEEPSPEED: + kwargs = { + "train_micro_batch_size_per_gpu": self.batch_size_per_gpu, + "train_batch_size": self.batch_size_per_gpu * accelerator.num_processes, + } + AcceleratorState().deepspeed_plugin.deepspeed_config_process(must_match=True, **kwargs) + eval_logger.info("Detected that you are using DistributedType.DEEPSPEED. Make sure you run `accelerate config` and set zero stage to 0") + + if accelerator.distributed_type == DistributedType.FSDP or accelerator.distributed_type == DistributedType.DEEPSPEED: + self._model = accelerator.prepare(self.model) + else: + self._model = accelerator.prepare_model(self.model, evaluation_mode=True) + self.accelerator = accelerator + if self.accelerator.is_local_main_process: + eval_logger.info(f"Using {accelerator.num_processes} devices with data parallelism") + self._rank = self.accelerator.local_process_index + self._world_size = self.accelerator.num_processes + elif accelerator.num_processes == 1 and device_map == "auto": + eval_logger.info(f"Using {accelerator.num_processes} devices with tensor parallelism") + self._rank = 0 + self._word_size = 1 + else: + eval_logger.info(f"Using single device: {self._device}") + self.model.to(self._device) + self._rank = 0 + self._world_size = 1 + + # from internvl model + + self.image_size = config.force_image_size or config.vision_config.image_size + + def wrap_backbone_lora(self, r=128, lora_alpha=256, lora_dropout=0.05): + lora_config = LoraConfig( + r=r, + target_modules=["attn.qkv", "attn.proj", "mlp.fc1", "mlp.fc2"], + lora_alpha=lora_alpha, + lora_dropout=lora_dropout, + ) + self.vision_model = get_peft_model(self.vision_model, lora_config) + self.vision_model.print_trainable_parameters() + + def wrap_llm_lora(self, r=128, lora_alpha=256, lora_dropout=0.05): + lora_config = LoraConfig( + r=r, target_modules=["self_attn.q_proj", "self_attn.k_proj", "self_attn.v_proj", "self_attn.o_proj", "mlp.gate_proj", "mlp.down_proj", "mlp.up_proj"], lora_alpha=lora_alpha, lora_dropout=lora_dropout, task_type="CAUSAL_LM" + ) + self.language_model = get_peft_model(self.language_model, lora_config) + self.language_model.enable_input_require_grads() + self.language_model.print_trainable_parameters() + + def pixel_shuffle(self, x, scale_factor=0.5): + n, w, h, c = x.size() + # N, W, H, C --> N, W, H * scale, C // scale + x = x.view(n, w, int(h * scale_factor), int(c / scale_factor)) + # N, W, H * scale, C // scale --> N, H * scale, W, C // scale + x = x.permute(0, 2, 1, 3).contiguous() + # N, H * scale, W, C // scale --> N, H * scale, W * scale, C // (scale ** 2) + x = x.view(n, int(h * scale_factor), int(w * scale_factor), int(c / (scale_factor * scale_factor))) + if self.ps_version == "v1": + warnings.warn("In ps_version 'v1', the height and width have not been swapped back, " "which results in a transposed image.") + else: + x = x.permute(0, 2, 1, 3).contiguous() + return x + + def noised_embed(self, vit_embeds, noise_alpha=5): + dims = torch.tensor(vit_embeds.size(1) * vit_embeds.size(2)) + mag_norm = noise_alpha / torch.sqrt(dims) + noise = torch.zeros_like(vit_embeds).uniform_(-mag_norm, mag_norm) + return vit_embeds + noise + + def extract_feature(self, pixel_values): + if self.select_layer == -1: + vit_embeds = self.vision_model(pixel_values=pixel_values, output_hidden_states=False, return_dict=True).last_hidden_state + else: + vit_embeds = self.vision_model(pixel_values=pixel_values, output_hidden_states=True, return_dict=True).hidden_states[self.select_layer] + vit_embeds = vit_embeds[:, 1:, :] + + if self.training and self.neftune_alpha is not None: + vit_embeds = self.noised_embed(vit_embeds, self.neftune_alpha) + + h = w = int(vit_embeds.shape[1] ** 0.5) + vit_embeds = vit_embeds.reshape(vit_embeds.shape[0], h, w, -1) + vit_embeds = self.pixel_shuffle(vit_embeds, scale_factor=self.downsample_ratio) + vit_embeds = vit_embeds.reshape(vit_embeds.shape[0], -1, vit_embeds.shape[-1]) + vit_embeds = self.mlp1(vit_embeds) # .to(pixel_values.device) + return vit_embeds + + def multi_image_chat(self, tokenizer, pixel_values, image_counts, question, generation_config, history=None, return_history=False, IMG_START_TOKEN="", IMG_END_TOKEN="", IMG_CONTEXT_TOKEN=""): + img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN) + self.img_context_token_id = img_context_token_id + if tokenizer.convert_tokens_to_ids("<|im_end|>") != 0: + eos_token_id = tokenizer.convert_tokens_to_ids("<|im_end|>") # 92542, InternLM2 + else: + eos_token_id = tokenizer.eos_token_id + + from internvl.conversation import get_conv_template + + template = get_conv_template(self.template) + + if history is None: + history = [] + image_tokens = "" + image_bs = pixel_values.shape[0] + # print(f"dynamic ViT batch size: {image_bs}, image_counts: {image_counts}") + for idx, image_count in enumerate(image_counts): + image_tokens += f" (图{idx+1}):" + IMG_START_TOKEN + IMG_CONTEXT_TOKEN * self.num_image_token * image_count + IMG_END_TOKEN + question = image_tokens + "\n" + question + else: + for old_question, old_answer in history: + template.append_message(template.roles[0], old_question) + template.append_message(template.roles[1], old_answer) + template.append_message(template.roles[0], question) + template.append_message(template.roles[1], None) + query = template.get_prompt() + model_inputs = tokenizer(query, return_tensors="pt") + input_ids = model_inputs["input_ids"].cuda() + attention_mask = model_inputs["attention_mask"].cuda() + generation_config["eos_token_id"] = eos_token_id + + generation_output = self.generate(pixel_values=pixel_values, input_ids=input_ids, attention_mask=attention_mask, **generation_config) + response = tokenizer.batch_decode(generation_output, skip_special_tokens=True)[0] + response = response.split("<|im_end|>")[0].strip() # for InternLM2 + history.append((question, response)) + if return_history: + return response, history + else: + query_to_print = query.replace(image_tokens, "") + # print(query_to_print, response) + return response + return response + + @property + def tokenizer(self): + return self._tokenizer + + @property + def model(self): + # returns the model, unwrapping it if using Accelerate + if hasattr(self, "accelerator"): + return self.accelerator.unwrap_model(self._model) + else: + return self._model + + @property + def batch_size(self): + return self.batch_size_per_gpu + + @property + def device(self): + return self._device + + @property + def rank(self): + return self._rank + + @property + def world_size(self): + return self._world_size + + def tok_encode(self, string: str, left_truncate_len=None, add_special_tokens=None) -> List[int]: + """ """ + add_special_tokens = False if add_special_tokens is None else add_special_tokens + encoding = self.tokenizer.encode(string, add_special_tokens=add_special_tokens) + # left-truncate the encoded context to be at most `left_truncate_len` tokens long + if left_truncate_len: + encoding = encoding[-left_truncate_len:] + return encoding + + def tok_decode(self, tokens): + try: + return self.tokenizer.decode(tokens) + except: + return self.tokenizer.decode([tokens]) + + def post_processing(self, response): + response = response.replace("\n", "").replace("不是", "No").replace("是", "Yes").replace("否", "No") + response = response.lower().replace("true", "yes").replace("false", "no") + pattern = re.compile(r"[\u4e00-\u9fa5]") + response = re.sub(pattern, "", response) + return response + + @torch.no_grad() + def generate( + self, + pixel_values: Optional[torch.FloatTensor] = None, + input_ids: Optional[torch.FloatTensor] = None, + attention_mask: Optional[torch.LongTensor] = None, + visual_features: Optional[torch.FloatTensor] = None, + generation_config: Optional[GenerationConfig] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + **generate_kwargs, + ) -> torch.LongTensor: + assert self.img_context_token_id is not None + if pixel_values is not None: + if visual_features is not None: + vit_embeds = visual_features + else: + vit_embeds = self.extract_feature(pixel_values) + + input_embeds = self.language_model.get_input_embeddings()(input_ids) + B, N, C = input_embeds.shape + input_embeds = input_embeds.reshape(B * N, C) + + input_ids = input_ids.reshape(B * N) + selected = input_ids == self.img_context_token_id + assert selected.sum() != 0 + input_embeds[selected] = vit_embeds.reshape(-1, C).to(input_embeds.device) + + input_embeds = input_embeds.reshape(B, N, C) + else: + input_embeds = self.language_model.get_input_embeddings()(input_ids) + + outputs = self.language_model.generate( + inputs_embeds=input_embeds, + attention_mask=attention_mask, + generation_config=generation_config, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + use_cache=True, + **generate_kwargs, + ) + + return outputs + + def flatten(self, input): + new_list = [] + for i in input: + for j in i: + new_list.append(j) + return new_list + + def load_image(self, flattened_visuals, input_size=224): + assert flattened_visuals[0].mode == "RGB" + image = flattened_visuals[0].convert("RGB") + transform = build_transform(is_train=False, input_size=input_size) + if self.dynamic: + images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=self.use_thumbnail, max_num=self.max_num) + else: + images = [image] + pixel_values = [transform(image) for image in images] + pixel_values = torch.stack(pixel_values) + return pixel_values + + def generate_until(self, requests: List[Instance]) -> List[str]: + res = [] + + def _collate(x): + # the negative sign on len(toks) sorts descending - this has a few advantages: + # - time estimates will always be over not underestimates, which is more useful for planning + # - to know the size of a batch when going through the list, you know the first one is always the batch + # padded context length. this is useful to simplify the batching logic and more importantly to make + # automatic adaptive batches much much easier to implement + # - any OOMs will happen right away rather than near the end + toks = self.tok_encode(x[0]) + return -len(toks), x[0] + + # we group requests by their generation_kwargs, + # so that we don't try to execute e.g. greedy sampling and temp=0.8 sampling + # in the same batch. + re_ords = utils.Collator([reg.args for reg in requests], _collate, grouping=True) + chunks = re_ords.get_batched(n=self.batch_size, batch_fn=None) + num_iters = len(requests) // self.batch_size if len(requests) % self.batch_size == 0 else len(requests) // self.batch_size + 1 + pbar = tqdm(total=num_iters, disable=(self.rank != 0), desc="Model Responding") + for chunk in chunks: + contexts, all_gen_kwargs, doc_to_visual, doc_id, task, split = zip(*chunk) + task = task[0] + split = split[0] + batched_visuals = [doc_to_visual[0](self.task_dict[task][split][ids]) for ids in doc_id] # [B, N] + flattened_visuals = self.flatten(batched_visuals) + pixel_values = self.load_image(flattened_visuals, self.image_size).cuda().to(torch.bfloat16) + gen_kwargs = all_gen_kwargs[0] + + if "max_new_tokens" not in gen_kwargs: + gen_kwargs["max_new_tokens"] = 1024 + if "temperature" not in gen_kwargs: + gen_kwargs["temperature"] = 0 + if "top_p" not in gen_kwargs: + gen_kwargs["top_p"] = None + if "num_beams" not in gen_kwargs: + gen_kwargs["num_beams"] = 1 + + generation_config = dict( + do_sample=False, + top_k=50, + top_p=gen_kwargs["top_p"], + num_beams=gen_kwargs["num_beams"], + max_new_tokens=gen_kwargs["max_new_tokens"], + eos_token_id=self.tokenizer.eos_token_id, + ) + question = contexts[0] + response = self.model.chat(tokenizer=self.tokenizer, pixel_values=pixel_values, question=question, generation_config=generation_config) + # TODO(choiszt) try batch_chat for multiple inputs + response = self.post_processing(response) + res.append(response) + self.cache_hook.add_partial("generate_until", (question, gen_kwargs), response) + pbar.update(1) + res = re_ords.get_original(res) + return res + # print(chunk) + + def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]: + pass diff --git a/lmms_eval/models/llama_vid.py b/lmms_eval/models/llama_vid.py new file mode 100644 index 00000000..69627fe8 --- /dev/null +++ b/lmms_eval/models/llama_vid.py @@ -0,0 +1,272 @@ +import logging +import os +from accelerate import Accelerator, DistributedType, InitProcessGroupKwargs +from accelerate.state import AcceleratorState +from typing import List, Optional, Union, Tuple +import torch +from tqdm import tqdm +from decord import VideoReader, cpu +import numpy as np +import math +from datetime import timedelta +from transformers import AutoConfig +from huggingface_hub import snapshot_download +import requests + +from lmms_eval import utils +from lmms_eval.api.instance import Instance +from lmms_eval.api.model import lmms +from lmms_eval.api.registry import register_model +from lmms_eval.utils import stop_sequences_criteria +from lmms_eval.models.model_utils.load_video import read_video_pyav + +import subprocess + +eval_logger = logging.getLogger("lmms-eval") + +try: + from llamavid.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN + from llamavid.conversation import conv_templates, SeparatorStyle + from llamavid.model.builder import load_pretrained_model + from llava.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria +except ImportError: + eval_logger.debug("LLaMA-Video is not installed. Please install LLaMA-Video to use this model.") + + +@register_model("llama_vid") +class LLaMAVid(lmms): + def __init__( + self, + pretrained: str = "YanweiLi/llama-vid-7b-full-224-video-fps-1", + truncation: Optional[bool] = True, + device: Optional[str] = "cuda:0", + dtype: Optional[Union[str, torch.dtype]] = "auto", + batch_size: Optional[Union[int, str]] = 1, + trust_remote_code: Optional[bool] = False, + revision=None, + attn_implementation=( + "sdpa" if torch.__version__ > "2.1.2" else "eager" + ), # inference implementation for attention, can be "sdpa", "eager", "flash_attention_2". Seems FA2 is not effective during inference: https://discuss.huggingface.co/t/flash-attention-has-no-effect-on-inference/73453/5 + device_map="cuda:0", + conv_template="vicuna_v1", + use_cache=True, + truncate_context=False, + num_frames: int = 100, + **kwargs, + ) -> None: + super().__init__() + + accelerator_kwargs = InitProcessGroupKwargs(timeout=timedelta(weeks=52)) + accelerator = Accelerator(kwargs_handlers=[accelerator_kwargs]) + if accelerator.num_processes > 1: + self._device = torch.device(f"cuda:{accelerator.local_process_index}") + self.device_map = f"cuda:{accelerator.local_process_index}" + elif accelerator.num_processes == 1 and device_map == "auto": + self._device = torch.device(device) + self.device_map = device_map + else: + self._device = torch.device(f"cuda:{accelerator.local_process_index}") + self.device_map = f"cuda:{accelerator.local_process_index}" + + self.pretrained = pretrained + self.model_path = snapshot_download(self.pretrained) + self.model_name = get_model_name_from_path(pretrained) + self.num_frames = num_frames + if not os.path.exists("./model_zoo/LAVIS/eva_vit_g.pth") and accelerator.is_main_process: + eval_logger.info("\n\n Eva Encoder is not found for LLaMA-VID. Download automatically to the folder ./model_zoo/LAVIS") + cache_path = "model_zoo/LAVIS" + os.makedirs(cache_path, exist_ok=True) + subprocess.run(["wget https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/eva_vit_g.pth -O ./model_zoo/LAVIS/eva_vit_g.pth"], shell=True) + + accelerator.wait_for_everyone() + self._tokenizer, self._model, self.image_processor, self._max_length = load_pretrained_model( + self.model_path, + None, + self.model_name, + device_map=self.device_map, + ) + + self._config = self._model.config + self.model.eval() + self.model.tie_weights() + self.truncation = truncation + self.batch_size_per_gpu = int(batch_size) + self.conv_template = conv_template + self.use_cache = use_cache + self.truncate_context = truncate_context + # assert self.batch_size_per_gpu == 1, "Llava currently does not support batched generation. See https://github.com/haotian-liu/LLaVA/issues/754. HF Llava also has this issue." + if accelerator.num_processes > 1: + assert accelerator.distributed_type in [DistributedType.FSDP, DistributedType.MULTI_GPU, DistributedType.DEEPSPEED], "Unsupported distributed type provided. Only DDP and FSDP are supported." + # If you want to use DistributedType.DEEPSPEED, you have to run accelerate config before using the model + # Also, you have to select zero stage 0 (equivalent to DDP) in order to make the prepare model works + # I tried to set different parameters in the kwargs to let default zero 2 stage works, but it didn't work. + if accelerator.distributed_type == DistributedType.DEEPSPEED: + kwargs = { + "train_micro_batch_size_per_gpu": self.batch_size_per_gpu, + "train_batch_size": self.batch_size_per_gpu * accelerator.num_processes, + } + AcceleratorState().deepspeed_plugin.deepspeed_config_process(must_match=True, **kwargs) + eval_logger.info("Detected that you are using DistributedType.DEEPSPEED. Make sure you run `accelerate config` and set zero stage to 0") + if accelerator.distributed_type == DistributedType.FSDP or accelerator.distributed_type == DistributedType.DEEPSPEED: + self._model = accelerator.prepare(self.model) + else: + self._model = accelerator.prepare_model(self.model, evaluation_mode=True) + self.accelerator = accelerator + if self.accelerator.is_local_main_process: + eval_logger.info(f"Using {accelerator.num_processes} devices with data parallelism") + self._rank = self.accelerator.local_process_index + self._world_size = self.accelerator.num_processes + elif accelerator.num_processes == 1 and device_map == "auto": + eval_logger.info(f"Using {accelerator.num_processes} devices with tensor parallelism") + self._rank = 0 + self._word_size = 1 + else: + eval_logger.info(f"Using single device: {self._device}") + self.model.to(self._device) + self._rank = 0 + self._world_size = 1 + + def download_file(self, url, folder_path): + # Create the folder if it doesn't exist + if not os.path.exists(folder_path): + os.makedirs(folder_path) + + # Extract filename from URL + filename = url.split("/")[-1] + + # Define path to save the file + file_path = os.path.join(folder_path, filename) + + # Send a GET request to the URL + response = requests.get(url) + + # Check if request was successful (status code 200) + if response.status_code == 200: + # Save the file to the specified folder + with open(file_path, "wb") as f: + f.write(response.content) + print(f"File downloaded successfully to {file_path}") + else: + print(f"Failed to download file. Status code: {response.status_code}") + + @property + def config(self): + # return the associated transformers.AutoConfig for the given pretrained model. + return self._config + + @property + def tokenizer(self): + return self._tokenizer + + @property + def model(self): + # returns the model, unwrapping it if using Accelerate + if hasattr(self, "accelerator"): + return self.accelerator.unwrap_model(self._model) + else: + return self._model + + @property + def eot_token_id(self): + # we use EOT because end of *text* is more accurate for what we're doing than end of *sentence* + return self.tokenizer.eos_token_id + + @property + def max_length(self): + return self._max_length + + def tok_encode(self, string: str, left_truncate_len=None, add_special_tokens=None) -> List[int]: + """ """ + add_special_tokens = False if add_special_tokens is None else add_special_tokens + encoding = self.tokenizer.encode(string, add_special_tokens=add_special_tokens) + # left-truncate the encoded context to be at most `left_truncate_len` tokens long + if left_truncate_len: + encoding = encoding[-left_truncate_len:] + return encoding + + def tok_decode(self, tokens): + return self.tokenizer.decode(tokens) + + def load_video(self, video_path): + vr = VideoReader(video_path, ctx=cpu(0)) + total_frame_num = len(vr) + fps = round(vr.get_avg_fps()) + frame_idx = [i for i in range(0, len(vr), fps)] + spare_frames = vr.get_batch(frame_idx).asnumpy() + return spare_frames + + def flatten(self, input): + new_list = [] + for i in input: + for j in i: + new_list.append(j) + return new_list + + def generate_until(self, requests) -> List[str]: + res = [] + pbar = tqdm(total=len(requests), disable=(self.rank != 0), desc="Model Responding") + + for contexts, gen_kwargs, doc_to_visual, doc_id, task, split in [reg.args for reg in requests]: + # encode, pad, and truncate contexts for this batch + visuals = [doc_to_visual(self.task_dict[task][split][doc_id])] + visuals = self.flatten(visuals) + videos = [] + for visual in visuals: + video = read_video_pyav(visual, num_frm=self.num_frames) + video = self.image_processor.preprocess(video, return_tensors="pt")["pixel_values"].half().cuda() + video = [video] + videos += video + qs = contexts + if self.model.config.mm_use_im_start_end: + qs = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN + "\n" + qs + else: + qs = DEFAULT_IMAGE_TOKEN + "\n" + qs + + conv = conv_templates[self.conv_template].copy() + conv.append_message(conv.roles[0], qs) + conv.append_message(conv.roles[1], None) + prompt = conv.get_prompt() + + input_ids = tokenizer_image_token(prompt, self.tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).cuda() + + stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2 + keywords = [stop_str] + stopping_criteria = KeywordsStoppingCriteria(keywords, self.tokenizer, input_ids) + + cur_prompt = contexts + with torch.inference_mode(): + self.model.update_prompt([[cur_prompt]]) + output_ids = self.model.generate(input_ids, images=video, do_sample=True, temperature=0.2, max_new_tokens=1024, use_cache=True, stopping_criteria=[stopping_criteria]) + + input_token_len = input_ids.shape[1] + n_diff_input_output = (input_ids != output_ids[:, :input_token_len]).sum().item() + if n_diff_input_output > 0: + print(f"[Warning] {n_diff_input_output} output_ids are not the same as the input_ids") + outputs = self.tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0] + outputs = outputs.strip() + if outputs.endswith(stop_str): + outputs = outputs[: -len(stop_str)] + outputs = outputs.strip() + pbar.update(1) + res.append(outputs) + + return res + + def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]: + return super().loglikelihood(requests) + + @property + def batch_size(self): + return self.batch_size_per_gpu + + @property + def device(self): + return self._device + + @property + def rank(self): + return self._rank + + @property + def world_size(self): + return self._world_size diff --git a/lmms_eval/models/llava.py b/lmms_eval/models/llava.py old mode 100644 new mode 100755 index bd21bd33..b49cf55b --- a/lmms_eval/models/llava.py +++ b/lmms_eval/models/llava.py @@ -16,6 +16,7 @@ from accelerate import Accelerator, DistributedType, InitProcessGroupKwargs from accelerate.state import AcceleratorState from typing import List, Optional, Union, Tuple +from packaging import version import warnings warnings.filterwarnings("ignore") @@ -25,12 +26,16 @@ try: from llava.model.builder import load_pretrained_model from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token - from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX - from llava.conversation import conv_templates, SeparatorStyle -except ImportError: - eval_logger.error("LLaVA is not installed. Please install LLaVA to use this model.") + from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN + from llava.conversation import conv_templates +except Exception as e: + eval_logger.debug("LLaVA is not installed. Please install LLaVA to use this model.\nError: %s" % e) -if torch.__version__ > "2.1.2": +# inference implementation for attention, can be "sdpa", "eager", "flash_attention_2". Seems FA2 is not effective during inference: https://discuss.huggingface.co/t/flash-attention-has-no-effect-on-inference/73453/5 +# if is_flash_attn_2_available: +# best_fit_attn_implementation = "flash_attention_2" # flash_attn has a bug that says: ERROR Error query and key must have the same dtype in generating + +if version.parse(torch.__version__) >= version.parse("2.1.2"): best_fit_attn_implementation = "sdpa" else: best_fit_attn_implementation = "eager" @@ -46,19 +51,15 @@ def __init__( self, pretrained: str = "liuhaotian/llava-v1.5-7b", truncation: Optional[bool] = True, - device: Optional[str] = "cuda", - dtype: Optional[Union[str, torch.dtype]] = "auto", + device: Optional[str] = "cuda:0", batch_size: Optional[Union[int, str]] = 1, - trust_remote_code: Optional[bool] = False, - revision=None, model_name=None, attn_implementation=best_fit_attn_implementation, - use_flash_attention_2=True, - device_map="auto", + device_map="cuda:0", conv_template="vicuna_v1", use_cache=True, truncate_context=False, # whether to truncate the context in generation, set it False for LLaVA-1.6 - customized_config=None, + customized_config=None, # ends in json **kwargs, ) -> None: super().__init__() @@ -67,32 +68,33 @@ def __init__( accelerator_kwargs = InitProcessGroupKwargs(timeout=timedelta(weeks=52)) accelerator = Accelerator(kwargs_handlers=[accelerator_kwargs]) - if accelerator.num_processes > 1 and device_map == "": + if accelerator.num_processes > 1: self._device = torch.device(f"cuda:{accelerator.local_process_index}") self.device_map = f"cuda:{accelerator.local_process_index}" - else: + elif accelerator.num_processes == 1 and device_map == "auto": self._device = torch.device(device) self.device_map = device_map + else: + self._device = torch.device(f"cuda:{accelerator.local_process_index}") + self.device_map = f"cuda:{accelerator.local_process_index}" - llava_model_args = {} - llava_model_args["attn_implementation"] = attn_implementation - if customized_config: + llava_model_args = { + "multimodal": True, + } + if customized_config is not None: llava_model_args["customized_config"] = customized_config if attn_implementation is not None: llava_model_args["attn_implementation"] = attn_implementation if "use_flash_attention_2" in kwargs: llava_model_args["use_flash_attention_2"] = kwargs["use_flash_attention_2"] - model_name = model_name if model_name is not None else get_model_name_from_path(pretrained) try: # Try to load the model with the multimodal argument self._tokenizer, self._model, self._image_processor, self._max_length = load_pretrained_model(pretrained, None, model_name, device_map=self.device_map, **llava_model_args) except TypeError: - # for older versions of LLaVA that don't have multimodal and attn_implementation arguments + # for older versions of LLaVA that don't have multimodal argument llava_model_args.pop("multimodal", None) - llava_model_args.pop("attn_implementation", None) self._tokenizer, self._model, self._image_processor, self._max_length = load_pretrained_model(pretrained, None, model_name, device_map=self.device_map, **llava_model_args) - self._config = self._model.config self.model.eval() self.model.tie_weights() @@ -102,7 +104,7 @@ def __init__( self.use_cache = use_cache self.truncate_context = truncate_context # assert self.batch_size_per_gpu == 1, "Llava currently does not support batched generation. See https://github.com/haotian-liu/LLaVA/issues/754. HF Llava also has this issue." - if accelerator.num_processes > 1 and device_map == "": + if accelerator.num_processes > 1: assert accelerator.distributed_type in [DistributedType.FSDP, DistributedType.MULTI_GPU, DistributedType.DEEPSPEED], "Unsupported distributed type provided. Only DDP and FSDP are supported." # If you want to use DistributedType.DEEPSPEED, you have to run accelerate config before using the model # Also, you have to select zero stage 0 (equivalent to DDP) in order to make the prepare model works @@ -194,7 +196,10 @@ def tok_encode(self, string: str, left_truncate_len=None, add_special_tokens=Non return encoding def tok_decode(self, tokens): - return self.tokenizer.decode(tokens) + try: + return self.tokenizer.decode(tokens) + except: + return self.tokenizer.decode([tokens]) def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]: # TODO @@ -209,6 +214,7 @@ def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]: continuation = doc_to_target(self.task_dict[task][split][doc_id]) visuals = [doc_to_visual(self.task_dict[task][split][doc_id])] visuals = self.flatten(visuals) + image_sizes = [[visual.size[0], visual.size[1]] for visual in visuals] if visuals: image = process_images(visuals, self._image_processor, self._config) if type(image) is list: @@ -250,7 +256,7 @@ def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]: # Context part no need to calculate for loss labels[0, : contxt_id.shape[1]] = -100 with torch.inference_mode(): - outputs = self.model(input_ids=input_ids, labels=labels, images=image, use_cache=True) + outputs = self.model(input_ids=input_ids, labels=labels, images=image, use_cache=True, image_sizes=image_sizes) loss = outputs["loss"] # loss = torch.exp(loss) logits = outputs["logits"] @@ -294,8 +300,8 @@ def _collate(x): contexts, all_gen_kwargs, doc_to_visual, doc_id, task, split = zip(*chunk) task = task[0] split = split[0] - visuals = [doc_to_visual[0](self.task_dict[task][split][ids]) for ids in doc_id] - visuals = self.flatten(visuals) + batched_visuals = [doc_to_visual[0](self.task_dict[task][split][ids]) for ids in doc_id] # [B, N] + flattened_visuals = self.flatten(batched_visuals) # [B*N] # we assume all gen kwargs in the batch are the same # this is safe to assume because the `grouper` object ensures it. gen_kwargs = all_gen_kwargs[0] @@ -316,8 +322,8 @@ def _collate(x): self._config.image_aspect_ratio = gen_kwargs.pop("image_aspect_ratio") eval_logger.info(f"Setting image aspect ratio: {self._config.image_aspect_ratio}") # encode, pad, and truncate contexts for this batch - if visuals: - image_tensor = process_images(visuals, self._image_processor, self._config) + if flattened_visuals: + image_tensor = process_images(flattened_visuals, self._image_processor, self._config) if type(image_tensor) is list: image_tensor = [_image.to(dtype=torch.float16, device=self.device) for _image in image_tensor] else: @@ -329,7 +335,7 @@ def _collate(x): question_input = [] - for visual, context in zip(visuals, contexts): + for visual, context in zip(batched_visuals, contexts): if image_tensor is not None and len(image_tensor) != 0 and DEFAULT_IMAGE_TOKEN not in context: """ Three senarios: @@ -342,7 +348,6 @@ def _collate(x): question = image_tokens + "\n" + context else: question = context - # This is much safer for llama3, as we now have some object type in it if "llama_3" in self.conv_template: conv = copy.deepcopy(conv_templates[self.conv_template]) @@ -356,7 +361,7 @@ def _collate(x): # The above for loop has bugs. When there is no visuals, e.g. pure text, # there will be no for loop execute resulting in an empty question_input (because no visuals) # Scenario 1 won't even be execute - if len(visuals) == 0: + if len(flattened_visuals) == 0: for context in contexts: question = context conv = conv_templates[self.conv_template].copy() @@ -367,7 +372,7 @@ def _collate(x): # input_ids = tokenizer_image_token(prompt, self.tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(self.device) # preconfigure gen_kwargs with defaults - gen_kwargs["image_sizes"] = [visuals[idx].size for idx in range(len(visuals))] + gen_kwargs["image_sizes"] = [flattened_visuals[idx].size for idx in range(len(flattened_visuals))] if "max_new_tokens" not in gen_kwargs: gen_kwargs["max_new_tokens"] = 1024 if "temperature" not in gen_kwargs: @@ -382,7 +387,7 @@ def _collate(x): input_ids = self.pad_sequence(input_ids_list, batch_first=True, padding_value=pad_token_ids).to(self.device) attention_masks = input_ids.ne(pad_token_ids).to(self.device) # These steps are not in LLaVA's original code, but are necessary for generation to work - # TODO: pay attention to this major generation step... + # TODO: attention to this major generation step... try: cont = self.model.generate( input_ids, @@ -399,6 +404,7 @@ def _collate(x): ) text_outputs = self.tokenizer.batch_decode(cont, skip_special_tokens=True) except Exception as e: + raise e eval_logger.error(f"Error {e} in generating") cont = "" text_outputs = [""] diff --git a/lmms_eval/models/llava_sglang.py b/lmms_eval/models/llava_sglang.py index 01f23535..47c67288 100644 --- a/lmms_eval/models/llava_sglang.py +++ b/lmms_eval/models/llava_sglang.py @@ -1,4 +1,5 @@ import torch +import random torch.backends.cuda.matmul.allow_tf32 = True @@ -11,7 +12,6 @@ from lmms_eval.api.model import lmms from lmms_eval.api.registry import register_model -from accelerate import Accelerator, InitProcessGroupKwargs from typing import List, Optional, Union, Tuple import warnings @@ -25,7 +25,7 @@ import sglang as sgl from sglang.lang.chat_template import get_chat_template except ImportError: - eval_logger.error("SGLang is not installed. If you want to use llava_sglang, please install it using pip install 'sglang[all]' ") + eval_logger.debug("SGLang is not installed. If you want to use llava_sglang, please install it using pip install 'sglang[all]' ") if torch.__version__ > "2.1.2": best_fit_attn_implementation = "sdpa" @@ -53,11 +53,11 @@ def __init__( self.tokenizer = tokenizer self.tp_size = tp_size self.conv_template = conv_template - torch.multiprocessing.set_start_method("spawn") + # torch.multiprocessing.set_start_method("spawn") - accelerator_kwargs = InitProcessGroupKwargs(timeout=timedelta(weeks=52)) - accelerator = Accelerator(kwargs_handlers=[accelerator_kwargs]) - assert accelerator.num_processes == 1, "Llava-sglang does not support multi-processes yet (it does support tensor parallelism)." + # accelerator_kwargs = InitProcessGroupKwargs(timeout=timedelta(weeks=52)) + # accelerator = Accelerator(kwargs_handlers=[accelerator_kwargs]) + # assert accelerator.num_processes == 1, "Llava-sglang does not support multi-processes yet (it does support tensor parallelism)." self._rank = 0 self._world_size = 1 self.parallel = parallel @@ -66,8 +66,8 @@ def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]: raise NotImplementedError("Llava-sglang does not support loglikelihood evaluation yet") def generate_until(self, requests: List[Instance]) -> List[str]: - - runtime = sgl.Runtime(model_path=self.pretrained, tokenizer_path=self.tokenizer, tp_size=self.tp_size) + torch.multiprocessing.set_start_method("spawn", force=True) + runtime = sgl.Runtime(model_path=self.pretrained, tokenizer_path=self.tokenizer, tp_size=self.tp_size, port=random.randint(10000, 50000)) runtime.endpoint.chat_template = get_chat_template(self.conv_template) sgl.set_default_backend(runtime) @@ -109,9 +109,6 @@ def _collate(x): gen_kwargs["top_p"] = 1.0 if "num_beams" not in gen_kwargs: gen_kwargs["num_beams"] = 1 - if gen_kwargs["top_p"] == 0.0: - gen_kwargs["top_p"] = 1.0 - gen_kwargs["temperature"] = 0.0 assert gen_kwargs["num_beams"] == 1 def save_image_to_temp_file(image): diff --git a/lmms_eval/models/llava_vid.py b/lmms_eval/models/llava_vid.py new file mode 100755 index 00000000..abd42c36 --- /dev/null +++ b/lmms_eval/models/llava_vid.py @@ -0,0 +1,419 @@ +import logging +from accelerate import Accelerator, DistributedType, InitProcessGroupKwargs +from accelerate.state import AcceleratorState +from typing import List, Optional, Union, Tuple +import torch +from tqdm import tqdm +from decord import VideoReader, cpu +import numpy as np +import math +from datetime import timedelta +from transformers import AutoConfig +import copy + +from lmms_eval.api.instance import Instance +from lmms_eval.api.model import lmms +from lmms_eval.api.registry import register_model +from lmms_eval.models.model_utils.load_video import read_video_pyav + +eval_logger = logging.getLogger("lmms-eval") +import sys + +sys.path.append("llava-video") +try: + from llavavid.model.language_model.llava_llama import LlavaConfig + + # from llavavid.model.language_model.llava_qwen import LlavaQwenConfig + from llavavid.model.builder import load_pretrained_model + from llavavid.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria + from llavavid.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN + from llavavid.conversation import conv_templates, SeparatorStyle + + # AutoConfig.register("llava_qwen", LlavaQwenConfig) + AutoConfig.register("llava_llama", LlavaConfig) + +except ImportError: + eval_logger.debug("LLaVA-Video is not installed. Please install LLaVA-Video to use this model.") + +try: + from llavavid.model.language_model.llava_qwen import LlavaQwenConfig + + AutoConfig.register("llava_qwen", LlavaQwenConfig) +except: + eval_logger.debug("") + + +@register_model("llavavid") +class LlavaVid(lmms): + """ + LlavaVid Model + """ + + def __init__( + self, + pretrained: str = "liuhaotian/llava-v1.5-7b", + truncation: Optional[bool] = True, + device: Optional[str] = "cuda:0", + batch_size: Optional[Union[int, str]] = 1, + attn_implementation=( + "sdpa" if torch.__version__ >= "2.1.2" else "eager" + ), # inference implementation for attention, can be "sdpa", "eager", "flash_attention_2". Seems FA2 is not effective during inference: https://discuss.huggingface.co/t/flash-attention-has-no-effect-on-inference/73453/5 + device_map="cuda:0", + conv_template="vicuna_v1", + use_cache=True, + truncate_context=False, # whether to truncate the context in generation, set it False for LLaVA-1.6 + max_frames_num: int = 3, + mm_resampler_type: str = "spatial_pool", + mm_spatial_pool_stride: int = 2, + mm_spatial_pool_out_channels: int = 1024, + mm_spatial_pool_mode: str = "average", + overwrite: bool = True, + video_decode_backend: str = "pyav", + **kwargs, + ) -> None: + super().__init__() + assert kwargs == {}, f"Unexpected kwargs: {kwargs}" + + accelerator_kwargs = InitProcessGroupKwargs(timeout=timedelta(weeks=52)) + accelerator = Accelerator(kwargs_handlers=[accelerator_kwargs]) + if accelerator.num_processes > 1: + self._device = torch.device(f"cuda:{accelerator.local_process_index}") + self.device_map = f"cuda:{accelerator.local_process_index}" + elif accelerator.num_processes == 1 and device_map == "auto": + self._device = torch.device(device) + self.device_map = device_map + else: + self._device = torch.device(f"cuda:{accelerator.local_process_index}") + self.device_map = f"cuda:{accelerator.local_process_index}" + + self.pretrained = pretrained + self.model_name = get_model_name_from_path(pretrained) + self.video_decode_backend = video_decode_backend + # self._config = AutoConfig.from_pretrained(self.pretrained) + self.overwrite = overwrite + self.mm_resampler_type = mm_resampler_type + self.mm_spatial_pool_stride = int(mm_spatial_pool_stride) + self.mm_spatial_pool_out_channels = int(mm_spatial_pool_out_channels) + self.mm_spatial_pool_mode = mm_spatial_pool_mode + self.max_frames_num = int(max_frames_num) + if self.overwrite == True: + overwrite_config = {} + overwrite_config["mm_resampler_type"] = self.mm_resampler_type + overwrite_config["mm_spatial_pool_stride"] = self.mm_spatial_pool_stride + overwrite_config["mm_spatial_pool_out_channels"] = self.mm_spatial_pool_out_channels + overwrite_config["mm_spatial_pool_mode"] = self.mm_spatial_pool_mode + overwrite_config["mm_resampler_location"] = "before" + overwrite_config["patchify_video_feature"] = False + overwrite_config["attn_implementation"] = attn_implementation + + cfg_pretrained = AutoConfig.from_pretrained(self.pretrained) + + if cfg_pretrained.architectures[0] == "LlavaLlamaForCausalLM": # Ugly code, only used in vicuna that needs ROPE + if "224" in cfg_pretrained.mm_vision_tower: + least_token_number = self.max_frames_num * (16 // self.mm_spatial_pool_stride) ** 2 + 1000 + else: + least_token_number = self.max_frames_num * (24 // self.mm_spatial_pool_stride) ** 2 + 1000 + + scaling_factor = math.ceil(least_token_number / 4096) + if scaling_factor >= 2: + overwrite_config["rope_scaling"] = {"factor": float(scaling_factor), "type": "linear"} + overwrite_config["max_sequence_length"] = 4096 * scaling_factor + overwrite_config["tokenizer_model_max_length"] = 4096 * scaling_factor + + if "v1.5" in pretrained: # A hardcode solution here to load v1.5 model, otherwise it will use LlavaConfig from hf transformers + from transformers import AutoTokenizer + from llavavid.model.language_model.llava_llama import LlavaConfig, LlavaLlamaForCausalLM + + self._tokenizer = AutoTokenizer.from_pretrained(pretrained, use_fast=False) + cfg_pretrained = LlavaConfig.from_pretrained(pretrained) + if overwrite_config is not None: + print(f"Overwriting config with {overwrite_config}") + for k, v in overwrite_config.items(): + setattr(cfg_pretrained, k, v) + kwargs["torch_dtype"] = torch.float16 + self._model = LlavaLlamaForCausalLM.from_pretrained(pretrained, low_cpu_mem_usage=True, config=cfg_pretrained, device_map=self.device_map, **kwargs) + vision_tower = self._model.get_vision_tower() + if not vision_tower.is_loaded: + vision_tower.load_model(device_map=self.device_map) + if self.device_map != "auto": + vision_tower.to(device="cuda", dtype=torch.float16) + self._image_processor = vision_tower.image_processor + + if hasattr(self._model.config, "max_sequence_length"): + self._max_length = self._model.config.max_sequence_length + else: + self._max_length = 2048 + else: + self._tokenizer, self._model, self._image_processor, self._max_length = load_pretrained_model(pretrained, None, self.model_name, device_map=self.device_map, overwrite_config=overwrite_config) + else: + self._tokenizer, self._model, self._image_processor, self._max_length = load_pretrained_model( + pretrained, + None, + self.model_name, + device_map=self.device_map, + ) + + self._config = self._model.config + self.model.eval() + self.model.tie_weights() + self.truncation = truncation + self.batch_size_per_gpu = int(batch_size) + self.conv_template = conv_template + self.use_cache = use_cache + self.truncate_context = truncate_context + # assert self.batch_size_per_gpu == 1, "Llava currently does not support batched generation. See https://github.com/haotian-liu/LLaVA/issues/754. HF Llava also has this issue." + if accelerator.num_processes > 1: + assert accelerator.distributed_type in [DistributedType.FSDP, DistributedType.MULTI_GPU, DistributedType.DEEPSPEED], "Unsupported distributed type provided. Only DDP and FSDP are supported." + # If you want to use DistributedType.DEEPSPEED, you have to run accelerate config before using the model + # Also, you have to select zero stage 0 (equivalent to DDP) in order to make the prepare model works + # I tried to set different parameters in the kwargs to let default zero 2 stage works, but it didn't work. + if accelerator.distributed_type == DistributedType.DEEPSPEED: + kwargs = { + "train_micro_batch_size_per_gpu": self.batch_size_per_gpu, + "train_batch_size": self.batch_size_per_gpu * accelerator.num_processes, + } + AcceleratorState().deepspeed_plugin.deepspeed_config_process(must_match=True, **kwargs) + eval_logger.info("Detected that you are using DistributedType.DEEPSPEED. Make sure you run `accelerate config` and set zero stage to 0") + if accelerator.distributed_type == DistributedType.FSDP or accelerator.distributed_type == DistributedType.DEEPSPEED: + self._model = accelerator.prepare(self.model) + else: + self._model = accelerator.prepare_model(self.model, evaluation_mode=True) + self.accelerator = accelerator + if self.accelerator.is_local_main_process: + eval_logger.info(f"Using {accelerator.num_processes} devices with data parallelism") + self._rank = self.accelerator.local_process_index + self._world_size = self.accelerator.num_processes + elif accelerator.num_processes == 1 and device_map == "auto": + eval_logger.info(f"Using {accelerator.num_processes} devices with tensor parallelism") + self._rank = 0 + self._word_size = 1 + else: + eval_logger.info(f"Using single device: {self._device}") + self.model.to(self._device) + self._rank = 0 + self._world_size = 1 + + @property + def config(self): + # return the associated transformers.AutoConfig for the given pretrained model. + return self._config + + @property + def tokenizer(self): + return self._tokenizer + + @property + def model(self): + # returns the model, unwrapping it if using Accelerate + if hasattr(self, "accelerator"): + return self.accelerator.unwrap_model(self._model) + else: + return self._model + + @property + def eot_token_id(self): + # we use EOT because end of *text* is more accurate for what we're doing than end of *sentence* + return self.tokenizer.eos_token_id + + @property + def max_length(self): + return self._max_length + + def pad_sequence(self, input_ids, batch_first, padding_value): + if self.tokenizer.padding_side == "left": + input_ids = [torch.flip(_input_ids, [0]) for _input_ids in input_ids] + input_ids = torch.nn.utils.rnn.pad_sequence(input_ids, batch_first=batch_first, padding_value=padding_value) + if self.tokenizer.padding_side == "left": + input_ids = torch.flip(input_ids, [1]) + return input_ids + + @property + def batch_size(self): + return self.batch_size_per_gpu + + @property + def device(self): + return self._device + + @property + def rank(self): + return self._rank + + @property + def world_size(self): + return self._world_size + + def tok_encode(self, string: str, left_truncate_len=None, add_special_tokens=None) -> List[int]: + """ """ + add_special_tokens = False if add_special_tokens is None else add_special_tokens + encoding = self.tokenizer.encode(string, add_special_tokens=add_special_tokens) + # left-truncate the encoded context to be at most `left_truncate_len` tokens long + if left_truncate_len: + encoding = encoding[-left_truncate_len:] + return encoding + + def load_video(self, video_path, max_frames_num): + vr = VideoReader(video_path, ctx=cpu(0)) + total_frame_num = len(vr) + # fps = round(vr.get_avg_fps()) + # frame_idx = [i for i in range(0, len(vr), fps)] + uniform_sampled_frames = np.linspace(0, total_frame_num - 1, max_frames_num, dtype=int) + frame_idx = uniform_sampled_frames.tolist() + spare_frames = vr.get_batch(frame_idx).asnumpy() + return spare_frames # (frames, height, width, channels) + + def tok_decode(self, tokens): + return self.tokenizer.decode(tokens) + + def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]: + res = [] + pbar = tqdm(total=len(requests), disable=(self.rank != 0), desc="Model Responding") + + for contexts, doc_to_target, doc_to_visual, doc_id, task, split in [reg.args for reg in requests]: + # encode, pad, and truncate contexts for this batch + if type(doc_to_target) == str: + continuation = doc_to_target + else: + continuation = doc_to_target(self.task_dict[task][split][doc_id]) + visuals = [doc_to_visual(self.task_dict[task][split][doc_id])] + visuals = self.flatten(visuals) + videos = [] + for visual in visuals: + video = self.load_video(visual, self.max_frames_num) + video = self._image_processor.preprocess(video, return_tensors="pt")["pixel_values"].half().cuda() + videos.append(video) + + qs = contexts + if self.model.config.mm_use_im_start_end: + qs = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN + "\n" + qs + else: + qs = DEFAULT_IMAGE_TOKEN + "\n" + qs + + conv = conv_templates[self.conv_template].copy() + conv.append_message(conv.roles[0], qs) + conv.append_message(conv.roles[1], None) + prompt = conv.get_prompt() + + contxt_id = tokenizer_image_token(prompt, self.tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(self.device) + + conv = conv_templates[self.conv_template].copy() + conv.append_message(conv.roles[0], qs) + conv.append_message(conv.roles[1], continuation) + prompt = conv.get_prompt() + + input_ids = tokenizer_image_token(prompt, self.tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).cuda() + attention_masks = input_ids.ne(self.tokenizer.pad_token_id).long().cuda() + + labels = input_ids.clone() + # Context part no need to calculate for loss + labels[0, : contxt_id.shape[1]] = -100 + + with torch.inference_mode(): + outputs = self.model(input_ids=input_ids, labels=labels, images=videos, modalities="video") + + loss = outputs["loss"] + # loss = torch.exp(loss) + logits = outputs["logits"] + greedy_tokens = logits.argmax(dim=-1) + cont_toks = input_ids[:, contxt_id.shape[1] :] # [1, seq] + greedy_tokens = greedy_tokens[:, contxt_id.shape[1] : input_ids.shape[1]] # [1, seq] + max_equal = (greedy_tokens == cont_toks).all() + res.append((float(loss.item()), bool(max_equal))) + pbar.update(1) + pbar.close() + return res + + def flatten(self, input): + new_list = [] + for i in input: + for j in i: + new_list.append(j) + return new_list + + def generate_until(self, requests) -> List[str]: + res = [] + pbar = tqdm(total=len(requests), disable=(self.rank != 0), desc="Model Responding") + + for contexts, gen_kwargs, doc_to_visual, doc_id, task, split in [reg.args for reg in requests]: + # encode, pad, and truncate contexts for this batch + visuals = [doc_to_visual(self.task_dict[task][split][doc_id])] + visuals = self.flatten(visuals) + videos = [] + try: + for visual in visuals: + if self.video_decode_backend == "decord": + video = self.load_video(visual, self.max_frames_num) + elif self.video_decode_backend == "pyav": + video = read_video_pyav(visual, num_frm=self.max_frames_num) + # video = self.load_video(visual, self.max_frames_num) + video = self._image_processor.preprocess(video, return_tensors="pt")["pixel_values"].half().cuda() + videos.append(video) + except Exception as e: + eval_logger.info(f"{e}") + eval_logger.info(f"Video {visuals} can not load, check the source") + video_path = "\n".join(visuals) + res.append(f"Video {video_path} can not load, check the source") + pbar.update(1) + continue + + qs = contexts + if self.model.config.mm_use_im_start_end: + qs = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN + "\n" + qs + else: + qs = DEFAULT_IMAGE_TOKEN + "\n" + qs + + # This is much safer for llama3, as we now have some object type in it + if "llama_3" in self.conv_template: + conv = copy.deepcopy(conv_templates[self.conv_template]) + else: + conv = conv_templates[self.conv_template].copy() + + conv.append_message(conv.roles[0], qs) + conv.append_message(conv.roles[1], None) + prompt = conv.get_prompt() + + input_ids = tokenizer_image_token(prompt, self.tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).cuda() + pad_token_ids = self.tokenizer.pad_token_id if self.tokenizer.pad_token_id is not None else self.tokenizer.eos_token_id + if "llama_3" in self.conv_template: + pad_token_ids = 0 # lmms-lab/llama3-llava-8b is trained on this pad token id. You may need to customize this for other models. + attention_masks = input_ids.ne(pad_token_ids).long().cuda() + + # input_ids_list = [tokenizer_image_token(prompt, self.tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt") for prompt in question_input] + # pad_token_ids = self.tokenizer.pad_token_id if self.tokenizer.pad_token_id is not None else self.tokenizer.eos_token_id + # input_ids = self.pad_sequence(input_ids_list, batch_first=True, padding_value=pad_token_ids).to(self.device) + # attention_masks = input_ids.ne(pad_token_ids).to(self.device) + + stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2 + keywords = [stop_str] + stopping_criteria = KeywordsStoppingCriteria(keywords, self.tokenizer, input_ids) + + cur_prompt = contexts + + if "max_new_tokens" not in gen_kwargs: + gen_kwargs["max_new_tokens"] = 1024 + if "temperature" not in gen_kwargs: + gen_kwargs["temperature"] = 0.2 + if "top_p" not in gen_kwargs: + gen_kwargs["top_p"] = None + if "num_beams" not in gen_kwargs: + gen_kwargs["num_beams"] = 1 + with torch.inference_mode(): + output_ids = self.model.generate( + inputs=input_ids, + images=videos, + attention_mask=attention_masks, + modalities="video", + use_cache=self.use_cache, + stopping_criteria=[stopping_criteria], + do_sample=True if gen_kwargs["temperature"] > 0 else False, + temperature=gen_kwargs["temperature"], + top_p=gen_kwargs["top_p"], + num_beams=gen_kwargs["num_beams"], + max_new_tokens=gen_kwargs["max_new_tokens"], + ) + # output_ids = model.generate(inputs=input_ids, images=video, attention_mask=attention_masks, modalities="video", do_sample=True, temperature=0.2, use_cache=True, stopping_criteria=[stopping_criteria]) + + outputs = self.tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip() + res.append(outputs) + pbar.update(1) + return res diff --git a/lmms_eval/models/minicpm_v.py b/lmms_eval/models/minicpm_v.py old mode 100644 new mode 100755 diff --git a/lmms_eval/models/model_utils/__init__.py b/lmms_eval/models/model_utils/__init__.py old mode 100644 new mode 100755 diff --git a/lmms_eval/models/model_utils/load_video.py b/lmms_eval/models/model_utils/load_video.py new file mode 100644 index 00000000..789039e7 --- /dev/null +++ b/lmms_eval/models/model_utils/load_video.py @@ -0,0 +1,55 @@ +import av +from av.codec.context import CodecContext +import numpy as np + + +# This one is faster +def record_video_length_stream(container, indices): + frames = [] + start_index = indices[0] + end_index = indices[-1] + for i, frame in enumerate(container.decode(video=0)): + if i > end_index: + break + if i >= start_index and i in indices: + frames.append(frame) + return frames + + +# This one works for all types of video +def record_video_length_packet(container): + frames = [] + # https://github.com/PyAV-Org/PyAV/issues/1269 + # https://www.cnblogs.com/beyond-tester/p/17641872.html + # context = CodecContext.create("libvpx-vp9", "r") + for packet in container.demux(video=0): + for frame in packet.decode(): + frames.append(frame) + return frames + + +def read_video_pyav(video_path, num_frm=8): + + if "webm" not in video_path and "mkv" not in video_path: + # For mp4, we try loading with stream first + try: + container = av.open(video_path) + total_frames = container.streams.video[0].frames + sampled_frm = min(total_frames, num_frm) + indices = np.linspace(0, total_frames - 1, sampled_frm, dtype=int) + frames = record_video_length_stream(container, indices) + except: + container = av.open(video_path) + frames = record_video_length_packet(container) + total_frames = len(frames) + sampled_frm = min(total_frames, num_frm) + indices = np.linspace(0, total_frames - 1, sampled_frm, dtype=int) + frames = [frames[i] for i in indices] + else: + container = av.open(video_path) + frames = record_video_length_packet(container) + total_frames = len(frames) + sampled_frm = min(total_frames, num_frm) + indices = np.linspace(0, total_frames - 1, sampled_frm, dtype=int) + frames = [frames[i] for i in indices] + return np.stack([x.to_ndarray(format="rgb24") for x in frames]) diff --git a/lmms_eval/models/model_utils/qwen/qwen_generate_utils.py b/lmms_eval/models/model_utils/qwen/qwen_generate_utils.py old mode 100644 new mode 100755 diff --git a/lmms_eval/models/mplug_owl_video.py b/lmms_eval/models/mplug_owl_video.py new file mode 100644 index 00000000..bfc52d23 --- /dev/null +++ b/lmms_eval/models/mplug_owl_video.py @@ -0,0 +1,194 @@ +import logging +from accelerate import Accelerator, DistributedType, InitProcessGroupKwargs +from accelerate.state import AcceleratorState +from typing import List, Optional, Union, Tuple +import torch +from transformers import AutoTokenizer +from tqdm import tqdm +from datetime import timedelta + +from lmms_eval import utils +from lmms_eval.api.instance import Instance +from lmms_eval.api.model import lmms +from lmms_eval.api.registry import register_model +from lmms_eval.utils import stop_sequences_criteria + +from lmms_eval.models.mplug_owl_video.modeling_mplug_owl import MplugOwlForConditionalGeneration +from lmms_eval.models.mplug_owl_video.processing_mplug_owl import MplugOwlImageProcessor, MplugOwlProcessor + + +eval_logger = logging.getLogger("lmms-eval") + + +@register_model("mplug_owl_video") +class mplug_Owl(lmms): + def __init__( + self, + pretrained: str = "MAGAer13/mplug-owl-llama-7b-video", + device: Optional[str] = "cuda:0", + dtype: Optional[Union[str, torch.dtype]] = "auto", + batch_size: Optional[Union[int, str]] = 1, + device_map="cuda:0", + num_frames: Union[str, int] = 4, + **kwargs, + ) -> None: + """ + Install instructions: + 1. Install lmms-eval + cd lmms-eval + pip install -e .; + 2. Install other packages with restricted versions + pip install av sentencepiece protobuf==3.20 transformers==4.28.1 einops; + """ + super().__init__() + + accelerator_kwargs = InitProcessGroupKwargs(timeout=timedelta(weeks=52)) + accelerator = Accelerator(kwargs_handlers=[accelerator_kwargs]) + if accelerator.num_processes > 1: + self._device = torch.device(f"cuda:{accelerator.local_process_index}") + self.device_map = f"cuda:{accelerator.local_process_index}" + elif accelerator.num_processes == 1 and device_map == "auto": + self._device = torch.device(device) + self.device_map = device_map + else: + self._device = torch.device(f"cuda:{accelerator.local_process_index}") + self.device_map = f"cuda:{accelerator.local_process_index}" + + # import pdb; pdb.set_trace() + # This is very slow. Their issue, not mine + # Also, keep transformers in version 4.28.1 + # They put a Config object inside a config object, this is not acceptable + # for transformers == 4.39.1, object type not serializable + # Protobuf needs to be in 3.20.x otherwise error + # ヽ(`Д´)ノ + self._model = MplugOwlForConditionalGeneration.from_pretrained( + pretrained, + torch_dtype=torch.bfloat16, + ) + self.image_processor = MplugOwlImageProcessor.from_pretrained(pretrained) + self._tokenizer = AutoTokenizer.from_pretrained(pretrained) + self.processor = MplugOwlProcessor(self.image_processor, self.tokenizer) + self.model.eval() + self.batch_size_per_gpu = batch_size + self.num_frames = num_frames + + self.model.to(self.device) + + if accelerator.num_processes > 1: + assert accelerator.distributed_type in [DistributedType.FSDP, DistributedType.MULTI_GPU, DistributedType.DEEPSPEED], "Unsupported distributed type provided. Only DDP and FSDP are supported." + # If you want to use DistributedType.DEEPSPEED, you have to run accelerate config before using the model + # Also, you have to select zero stage 0 (equivalent to DDP) in order to make the prepare model works + # I tried to set different parameters in the kwargs to let default zero 2 stage works, but it didn't work. + if accelerator.distributed_type == DistributedType.DEEPSPEED: + kwargs = { + "train_micro_batch_size_per_gpu": self.batch_size_per_gpu, + "train_batch_size": self.batch_size_per_gpu * accelerator.num_processes, + } + AcceleratorState().deepspeed_plugin.deepspeed_config_process(must_match=True, **kwargs) + eval_logger.info("Detected that you are using DistributedType.DEEPSPEED. Make sure you run `accelerate config` and set zero stage to 0") + if accelerator.distributed_type == DistributedType.FSDP or accelerator.distributed_type == DistributedType.DEEPSPEED: + self._model = accelerator.prepare(self.model) + else: + self._model = accelerator.prepare_model(self.model, evaluation_mode=True) + self.accelerator = accelerator + if self.accelerator.is_local_main_process: + eval_logger.info(f"Using {accelerator.num_processes} devices with data parallelism") + self._rank = self.accelerator.local_process_index + self._world_size = self.accelerator.num_processes + else: + eval_logger.info(f"Using single device: {self._device}") + self.model.to(self._device) + self._rank = 0 + self._world_size = 1 + + @property + def config(self): + # return the associated transformers.AutoConfig for the given pretrained model. + return self._config + + @property + def tokenizer(self): + return self._tokenizer + + @property + def model(self): + # returns the model, unwrapping it if using Accelerate + if hasattr(self, "accelerator"): + return self.accelerator.unwrap_model(self._model) + else: + return self._model + + @property + def eot_token_id(self): + # we use EOT because end of *text* is more accurate for what we're doing than end of *sentence* + return self.tokenizer.eos_token_id + + @property + def max_length(self): + return self._max_length + + @property + def batch_size(self): + return self.batch_size_per_gpu + + @property + def device(self): + return self._device + + @property + def rank(self): + return self._rank + + @property + def world_size(self): + return self._world_size + + def flatten(self, input): + new_list = [] + for i in input: + for j in i: + new_list.append(j) + return new_list + + def format_prompt(self, question): + prompts = [f" <|video|> Question : {question} Answer : "] + return prompts + + def generate_until(self, requests) -> List[str]: + res = [] + pbar = tqdm(total=len(requests), disable=(self.rank != 0), desc="Model Responding") + + for contexts, gen_kwargs, doc_to_visual, doc_id, task, split in [reg.args for reg in requests]: + # encode, pad, and truncate contexts for this batch + visuals = [doc_to_visual(self.task_dict[task][split][doc_id])] + visuals = self.flatten(visuals) + inputs = self.processor(text=self.format_prompt(contexts), videos=visuals, num_frames=self.num_frames, return_tensors="pt") + pixel_values_videos = inputs["video_pixel_values"] + if pixel_values_videos.shape[2] != self.num_frames: + empty_frames = torch.zeros((1, pixel_values_videos.shape[1], self.num_frames - pixel_values_videos.shape[2], *pixel_values_videos.shape[3:]), dtype=pixel_values_videos.dtype) + pixel_values_videos = torch.cat([pixel_values_videos, empty_frames], dim=2) + inputs["video_pixel_values"] = pixel_values_videos + inputs = {k: v.bfloat16() if v.dtype == torch.float else v for k, v in inputs.items()} + inputs = {k: v.to(self.model.device) for k, v in inputs.items()} + + if "max_new_tokens" in gen_kwargs: + gen_kwargs["max_length"] = gen_kwargs["max_new_tokens"] + if "max_new_tokens" not in gen_kwargs: + gen_kwargs["max_length"] = 128 + if "do_sample" not in gen_kwargs: + gen_kwargs["do_sample"] = False + if "top_k" not in gen_kwargs: + gen_kwargs["top_k"] = 1 + + generate_kwargs = {"do_sample": gen_kwargs["do_sample"], "top_k": gen_kwargs["top_k"], "max_length": gen_kwargs["max_length"]} + + with torch.no_grad(): + outputs = self.model.generate(**inputs, **generate_kwargs) + sentence = self.tokenizer.decode(outputs.tolist()[0], skip_special_tokens=True) + pbar.update(1) + res.append(sentence) + pbar.close() + return res + + def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]: + return super().loglikelihood(requests) diff --git a/lmms_eval/models/mplug_owl_video/__init__.py b/lmms_eval/models/mplug_owl_video/__init__.py new file mode 100644 index 00000000..2020ad3a --- /dev/null +++ b/lmms_eval/models/mplug_owl_video/__init__.py @@ -0,0 +1,77 @@ +# Copyright 2020 The HuggingFace Team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from typing import TYPE_CHECKING + +from transformers.utils import OptionalDependencyNotAvailable, _LazyModule, is_tokenizers_available, is_torch_available + + +_import_structure = { + "configuration_mplug_owl": ["MPLUG_OWL_PRETRAINED_CONFIG_ARCHIVE_MAP", "MplugOwlConfig"], + "processing_mplug_owl": ["MplugOwlImageProcessor", "MplugOwlProcessor"], + "tokenization_mplug_owl": ["MplugOwlTokenizer"], +} + +try: + if not is_tokenizers_available(): + raise OptionalDependencyNotAvailable() +except OptionalDependencyNotAvailable: + pass + + +try: + if not is_torch_available(): + raise OptionalDependencyNotAvailable() +except OptionalDependencyNotAvailable: + pass +else: + _import_structure["modeling_mplug_owl"] = [ + "MPLUG_OWL_PRETRAINED_MODEL_ARCHIVE_LIST", + "MplugOwlForConditionalGeneration", + "MplugOwlModel", + ] + + +if TYPE_CHECKING: + from .configuration_mplug_owl import MPLUG_OWL_PRETRAINED_CONFIG_ARCHIVE_MAP, MplugOwlConfig + from .tokenization_mplug_owl import MplugOwlTokenizer + + try: + if not is_tokenizers_available(): + raise OptionalDependencyNotAvailable() + except OptionalDependencyNotAvailable: + pass + + try: + if not is_torch_available(): + raise OptionalDependencyNotAvailable() + except OptionalDependencyNotAvailable: + pass + else: + from .modeling_mplug_owl import ( + MPLUG_OWL_PRETRAINED_MODEL_ARCHIVE_LIST, + MplugOwlForConditionalGeneration, + MplugOwlModel, + MplugOwlPreTrainedModel, + ) + + +else: + import sys + + sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__) + +from .configuration_mplug_owl import * +from .modeling_mplug_owl import * +from .processing_mplug_owl import * +from .tokenization_mplug_owl import * diff --git a/lmms_eval/models/mplug_owl_video/configuration_mplug_owl.py b/lmms_eval/models/mplug_owl_video/configuration_mplug_owl.py new file mode 100644 index 00000000..6b5d458d --- /dev/null +++ b/lmms_eval/models/mplug_owl_video/configuration_mplug_owl.py @@ -0,0 +1,289 @@ +# coding=utf-8 +# Copyright 2022 x-plug and The HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" MplugOwl model configuration """ +import copy +import os +from typing import Union + +from transformers.configuration_utils import PretrainedConfig +from transformers.models.auto.modeling_auto import MODEL_FOR_CAUSAL_LM_MAPPING_NAMES +from transformers.utils import logging +from transformers.models.auto import CONFIG_MAPPING + + +logger = logging.get_logger(__name__) + +MPLUG_OWL_PRETRAINED_CONFIG_ARCHIVE_MAP = { + "MAGAer13/mplug-owl-llama-7b": "https://huggingface.co/MAGAer13/mplug-owl-llama-7b/resolve/main/config.json", + # See all MplugOwl models at https://huggingface.co/models?filter=mplug_owl +} + + +class MplugOwlVisionConfig(PretrainedConfig): + r""" + This is the configuration class to store the configuration of a [`MplugOwlVisionModel`]. It is used to instantiate a + mPLUG-Owl vision encoder according to the specified arguments, defining the model architecture. Instantiating a + configuration defaults will yield a similar configuration to that of the mPLUG-Owl + [x-plug/x_plug-llama-7b](https://huggingface.co/x-plug/x_plug-llama-7b) architecture. + + Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the + documentation from [`PretrainedConfig`] for more information. + + Args: + hidden_size (`int`, *optional*, defaults to 768): + Dimensionality of the encoder layers and the pooler layer. + intermediate_size (`int`, *optional*, defaults to 3072): + Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder. + num_hidden_layers (`int`, *optional*, defaults to 12): + Number of hidden layers in the Transformer encoder. + num_attention_heads (`int`, *optional*, defaults to 12): + Number of attention heads for each attention layer in the Transformer encoder. + image_size (`int`, *optional*, defaults to 224): + The size (resolution) of each image. + patch_size (`int`, *optional*, defaults to 32): + The size (resolution) of each patch. + hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`): + The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`, + `"relu"`, `"selu"` and `"gelu_new"` ``"quick_gelu"` are supported. + layer_norm_eps (`float`, *optional*, defaults to 1e-5): + The epsilon used by the layer normalization layers. + attention_dropout (`float`, *optional*, defaults to 0.0): + The dropout ratio for the attention probabilities. + initializer_range (`float`, *optional*, defaults to 0.02): + The standard deviation of the truncated_normal_initializer for initializing all weight matrices. + initializer_factor (`float`, *optional*, defaults to 1): + A factor for initializing all weight matrices (should be kept to 1, used internally for initialization + testing). + + + ```""" + + model_type = "mplug_owl_vision_model" + + def __init__( + self, + hidden_size=1024, + intermediate_size=4096, + projection_dim=768, + num_hidden_layers=24, + num_attention_heads=16, + num_channels=3, + image_size=224, + patch_size=14, + hidden_act="quick_gelu", + layer_norm_eps=1e-6, + attention_dropout=0.0, + initializer_range=0.02, + initializer_factor=1.0, + use_flash_attn=False, + **kwargs, + ): + super().__init__(**kwargs) + self.hidden_size = hidden_size + self.intermediate_size = intermediate_size + self.projection_dim = projection_dim + self.num_hidden_layers = num_hidden_layers + self.num_attention_heads = num_attention_heads + self.num_channels = num_channels + self.patch_size = patch_size + self.image_size = image_size + self.initializer_range = initializer_range + self.initializer_factor = initializer_factor + self.attention_dropout = attention_dropout + self.layer_norm_eps = layer_norm_eps + self.hidden_act = hidden_act + self.use_flash_attn = use_flash_attn + + @classmethod + def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig": + config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs) + + # get the vision config dict if we are loading from MplugOwlConfig + if config_dict.get("model_type") == "mplug-owl": + config_dict = config_dict["vision_config"] + + if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type: + logger.warning(f"You are using a model of type {config_dict['model_type']} to instantiate a model of type " f"{cls.model_type}. This is not supported for all configurations of models and can yield errors.") + + return cls.from_dict(config_dict, **kwargs) + + +class MplugOwlVisualAbstractorConfig(PretrainedConfig): + model_type = "mplug_owl_visual_abstract" + + def __init__( + self, + hidden_size=1024, # + num_hidden_layers=6, # + num_attention_heads=16, # + intermediate_size=4096, # + attention_probs_dropout_prob=0.1, # + initializer_range=0.02, + layer_norm_eps=1e-6, # + encoder_hidden_size=1024, # + **kwargs, + ): + super().__init__(**kwargs) + self.hidden_size = hidden_size + self.num_hidden_layers = num_hidden_layers + self.num_attention_heads = num_attention_heads + self.intermediate_size = intermediate_size + self.attention_probs_dropout_prob = attention_probs_dropout_prob + self.initializer_range = initializer_range + self.layer_norm_eps = layer_norm_eps + self.encoder_hidden_size = encoder_hidden_size + + @classmethod + def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig": + config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs) + + # get the visual_abstractor config dict if we are loading from MplugOwlConfig + if config_dict.get("model_type") == "mplug-owl": + config_dict = config_dict["abstractor_config"] + + if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type: + logger.warning(f"You are using a model of type {config_dict['model_type']} to instantiate a model of type " f"{cls.model_type}. This is not supported for all configurations of models and can yield errors.") + + return cls.from_dict(config_dict, **kwargs) + + +class MplugOwlConfig(PretrainedConfig): + r""" + [`MplugOwlConfig`] is the configuration class to store the configuration of a [`MplugOwlForConditionalGeneration`]. It is + used to instantiate a mPLUG-Owl model according to the specified arguments, defining the vision model, Q-Former model + and language model configs. Instantiating a configuration with the defaults will yield a similar configuration to + that of the mPLUG-Owl [x-plug/x_plug-llama-7b](https://huggingface.co/x-plug/x_plug-llama-7b) architecture. + + Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the + documentation from [`PretrainedConfig`] for more information. + + Args: + vision_config (`dict`, *optional*): + Dictionary of configuration options used to initialize [`MplugOwlVisionConfig`]. + visual_abstractor_config (`dict`, *optional*): + Dictionary of configuration options used to initialize [`MplugOwlVisualAbstractorConfig`]. + text_config (`dict`, *optional*): + Dictionary of configuration options used to initialize any [`PretrainedConfig`]. + num_query_tokens (`int`, *optional*, defaults to 32): + The number of query tokens passed through the Transformer. + + kwargs (*optional*): + Dictionary of keyword arguments. + + Example: + + ```python + >>> from transformers import ( + ... MplugOwlVisionConfig, + ... MplugOwlVisualAbstractorConfig, + ... OPTConfig, + ... MplugOwlConfig, + ... MplugOwlForConditionalGeneration, + ... ) + + >>> # Initializing a MplugOwlConfig with x-plug/x_plug-llama-7b style configuration + >>> configuration = MplugOwlConfig() + + >>> # Initializing a MplugOwlForConditionalGeneration (with random weights) from the x-plug/x_plug-llama-7b style configuration + >>> model = MplugOwlForConditionalGeneration(configuration) + + >>> # Accessing the model configuration + >>> configuration = model.config + + >>> # We can also initialize a MplugOwlConfig from a MplugOwlVisionConfig, MplugOwlVisualAbstractorConfig and any PretrainedConfig + + >>> # Initializing mPLUG-Owl vision, mPLUG-Owl Q-Former and language model configurations + >>> vision_config = MplugOwlVisionConfig() + >>> visual_abstractor_config = MplugOwlVisualAbstractorConfig() + >>> text_config = OPTConfig() + + >>> config = MplugOwlConfig.from_text_vision_configs(vision_config, visual_abstractor_config, text_config) + ```""" + + model_type = "mplug-owl" + is_composition = True + + def __init__(self, vision_config=None, visual_abstractor_config=None, text_config=None, num_query_tokens=64, **kwargs): + super().__init__(**kwargs) + if vision_config is None: + vision_config = MplugOwlVisionConfig().to_dict() + logger.info("vision_config is None.") + + if visual_abstractor_config is None: + visual_abstractor_config = {} + logger.info("abstractor_config is None. ") + + if text_config is None: + # we use LLAMA 7b by default + from ..llama.configuration_llama import LlamaConfig + + text_config = LlamaConfig(pad_token_id=2).to_dict() + logger.info("text_config is None.") + + self.vision_config = MplugOwlVisionConfig(**vision_config) + self.visual_abstractor_config = MplugOwlVisualAbstractorConfig(**visual_abstractor_config) + # self.visual_abstractor_config.layer_norm_eps = 1e-6 + text_model_type = text_config["model_type"] if "model_type" in text_config else "llama" + self.text_config = CONFIG_MAPPING[text_model_type](**text_config) + + self.tie_word_embeddings = self.text_config.tie_word_embeddings + self.is_encoder_decoder = self.text_config.is_encoder_decoder + + self.num_query_tokens = num_query_tokens + # self.visual_abstractor_config.encoder_hidden_size = self.vision_config.hidden_size + self.use_decoder_only_language_model = self.text_config.model_type in MODEL_FOR_CAUSAL_LM_MAPPING_NAMES + self.initializer_factor = 1.0 + self.initializer_range = 0.02 + + for attr in dir(self.text_config): + if not hasattr(self, attr): + setattr(self, attr, getattr(self.text_config, attr)) + + @classmethod + def from_vision_visual_abstractor_text_configs( + cls, + vision_config: MplugOwlVisionConfig, + visual_abstractor_config: MplugOwlVisualAbstractorConfig, + text_config: PretrainedConfig, + **kwargs, + ): + r""" + Instantiate a [`MplugOwlConfig`] (or a derived class) from a mPLUG-Owl vision model, Q-Former and language model + configurations. + + Returns: + [`MplugOwlConfig`]: An instance of a configuration object + """ + + return cls( + vision_config=vision_config.to_dict(), + visual_abstractor_config=visual_abstractor_config.to_dict(), + text_config=text_config.to_dict(), + **kwargs, + ) + + def to_dict(self): + """ + Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`]. + + Returns: + `Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance, + """ + output = copy.deepcopy(self.__dict__) + output["vision_config"] = self.vision_config.to_dict() + output["visual_abstractor_config"] = self.visual_abstractor_config.to_dict() + output["text_config"] = self.text_config.to_dict() + output["model_type"] = self.__class__.model_type + return output diff --git a/lmms_eval/models/mplug_owl_video/modeling_mplug_owl.py b/lmms_eval/models/mplug_owl_video/modeling_mplug_owl.py new file mode 100644 index 00000000..6c5b7592 --- /dev/null +++ b/lmms_eval/models/mplug_owl_video/modeling_mplug_owl.py @@ -0,0 +1,1841 @@ +# coding=utf-8 +# Copyright 2022 x-plug The HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" PyTorch MplugOwl model. """ + +import logging +import math +from typing import Any, Optional, Tuple, Union + +try: + from flash_attn.flash_attn_interface import flash_attn_unpadded_func + + flash_attn_func = flash_attn_unpadded_func +except: + flash_attn_func = None + print("Error importing flash_attn in mplug_owl. Please install flash-attn first.") +import math +from dataclasses import dataclass +from typing import Any, Optional, Tuple, Union + +import torch +import torch.utils.checkpoint +from torch import nn +import einops + +from transformers.modeling_outputs import BaseModelOutput, BaseModelOutputWithPooling, BaseModelOutputWithPastAndCrossAttentions +from transformers.modeling_utils import PreTrainedModel +from transformers.pytorch_utils import find_pruneable_heads_and_indices, prune_linear_layer +from transformers.utils import ( + ModelOutput, + add_start_docstrings, + add_start_docstrings_to_model_forward, + logging, + replace_return_docstrings, +) +from transformers.models.auto import AutoModelForCausalLM +from .configuration_mplug_owl import MplugOwlConfig, MplugOwlVisionConfig, MplugOwlVisualAbstractorConfig + + +logger = logging.get_logger(__name__) + +_CHECKPOINT_FOR_DOC = "MAGAer13/mplug-owl-llama-7b" +_CONFIG_FOR_DOC = "MplugOwlConfig" + + +MPLUG_OWL_PRETRAINED_MODEL_ARCHIVE_LIST = [ + "MAGAer13/mplug-owl-llama-7b", + # See all MplugOwl models at https://huggingface.co/models?filter=mplug_owl +] + + +@dataclass +class MplugOwlForConditionalGenerationModelOutput(ModelOutput): + """ + Class defining the outputs of [`MPlugOwlForConditionalGeneration`]. + + Args: + loss (`torch.FloatTensor`, *optional*, returned when `labels` is provided, `torch.FloatTensor` of shape `(1,)`): + Language modeling loss from the language model. + logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`): + Prediction scores of the language modeling head of the language model. + vision_outputs (`BaseModelOutputWithPooling`): + Outputs of the vision encoder. + + language_model_outputs (`CausalLMOutputWithPast` or `Seq2SeqLMOutput`): + Outputs of the language model. + """ + + loss: Optional[Tuple[torch.FloatTensor]] = None + logits: Optional[Tuple[torch.FloatTensor]] = None + vision_outputs: Optional[torch.FloatTensor] = None + language_model_outputs: Optional[Tuple[torch.FloatTensor]] = None + + def to_tuple(self) -> Tuple[Any]: + return tuple(self[k] if k not in ["vision_outputs", "language_model_outputs"] else getattr(self, k).to_tuple() for k in self.keys()) + + +def get_ltor_masks_and_position_ids_from_embeddings(data): + """Build masks and position id for left to right model.""" + + # Extract batch size and sequence length. + micro_batch_size, seq_length = data.size()[:2] + + # Attention mask (lower triangular). + att_mask_batch = 1 + attention_mask = torch.tril(torch.ones((att_mask_batch, seq_length, seq_length), device=data.device)).view(att_mask_batch, 1, seq_length, seq_length) + + # Loss mask. + loss_mask = torch.ones(data.size()[:2], dtype=torch.float, device=data.device) + + # Position ids. + position_ids = torch.arange(seq_length, dtype=torch.long, device=data.device) + position_ids = position_ids.unsqueeze(0).expand_as(data[..., 0]) + + # Convert attention mask to binary: + attention_mask = attention_mask < 0.5 + + return attention_mask, loss_mask, position_ids + + +class MplugOwlVisionEmbeddings(nn.Module): + def __init__(self, config: MplugOwlVisionConfig): + super().__init__() + self.config = config + self.hidden_size = config.hidden_size + self.image_size = config.image_size + self.patch_size = config.patch_size + + self.cls_token = nn.Parameter(torch.randn(1, 1, self.hidden_size)) + + self.patch_embed = nn.Conv2d( + in_channels=3, + out_channels=self.hidden_size, + kernel_size=self.patch_size, + stride=self.patch_size, + bias=False, + ) + + self.num_patches = (self.image_size // self.patch_size) ** 2 + + self.position_embedding = nn.Parameter(torch.randn(1, self.num_patches + 1, self.hidden_size)) + + self.pre_layernorm = LayerNormFp32(self.hidden_size, eps=config.layer_norm_eps) + + def forward(self, pixel_values: torch.FloatTensor) -> torch.Tensor: + # [B, C, T, H, W] or [B, C, H, W] + batch_size = pixel_values.size(0) + T = pixel_values.size(2) if pixel_values.dim() > 4 else 1 + if T > 1: + pixel_values = einops.rearrange(pixel_values, "b c t h w -> (b t) c h w") + image_embeds = self.patch_embed(pixel_values) + image_embeds = image_embeds.flatten(2).transpose(1, 2) + + class_embeds = self.cls_token.expand(batch_size * T, 1, -1).to(image_embeds.dtype) + embeddings = torch.cat([class_embeds, image_embeds], dim=1) + embeddings = embeddings + self.position_embedding[:, : embeddings.size(1)].to(image_embeds.dtype) + embeddings = self.pre_layernorm(embeddings) + embeddings = einops.rearrange(embeddings, "(b t) n d -> b t n d", b=batch_size) + return embeddings + + +class LayerNormFp32(nn.LayerNorm): + """Subclass torch's LayerNorm to handle fp16 (by casting to float32 and back).""" + + def __init__(self, *args, **kwargs): + super().__init__(*args, **kwargs) + + def forward(self, x: torch.Tensor): + output = torch.nn.functional.layer_norm( + x.float(), + self.normalized_shape, + self.weight.float() if self.weight is not None else None, + self.bias.float() if self.bias is not None else None, + self.eps, + ) + return output.type_as(x) + + +class QuickGELU(nn.Module): + def forward(self, x: torch.Tensor): + return x * torch.sigmoid(1.702 * x) + + +class MplugOwlVisionLocalTemporal(nn.Module): + def __init__(self, config): + super(MplugOwlVisionLocalTemporal, self).__init__() + + self.image_size = config.image_size + self.patch_size = config.patch_size + self.num_patches = 1 + (self.image_size // self.patch_size) ** 2 + self.hidden_size = config.hidden_size + d_bottleneck = self.hidden_size // 2 + + self.ln = LayerNormFp32(self.hidden_size) + self.down_proj = nn.Conv3d(self.hidden_size, d_bottleneck, kernel_size=1, stride=1, padding=0) + self.conv = nn.Conv3d(d_bottleneck, d_bottleneck, kernel_size=(3, 1, 1), stride=1, padding=(1, 0, 0), groups=d_bottleneck) + self.up_proj = nn.Conv3d(d_bottleneck, self.hidden_size, kernel_size=1, stride=1, padding=0) + + nn.init.constant_(self.up_proj.weight, 0) + nn.init.constant_(self.up_proj.bias, 0) + + self.activation_func = QuickGELU() + + def forward(self, x): + # [b, t, s, c] + T = x.size(1) + H = int((self.num_patches - 1) ** 0.5) + cls_token, x = x[:, :, 0:1], x[:, :, 1:] + x = self.ln(x) + x = einops.rearrange(x, "b t (h w) c -> b c t h w", h=H) + x = self.down_proj(x) + if self.conv.weight.dtype == torch.bfloat16: + x = torch.nn.functional.conv3d(x.half(), self.conv.weight.half(), bias=self.conv.bias.half(), stride=1, padding=(1, 0, 0), groups=self.conv.weight.shape[0]).to(cls_token.dtype) + else: + x = self.conv(x) + x = self.activation_func(x) + x = self.up_proj(x) + x = einops.rearrange(x, "b c t h w -> b t (h w) c") + x = torch.cat([cls_token, x], dim=2) + return x + + +class MplugOwlVisionAttention(nn.Module): + """Multi-headed attention from 'Attention Is All You Need' paper""" + + def __init__(self, config): + super().__init__() + self.config = config + self.hidden_size = config.hidden_size + self.num_heads = config.num_attention_heads + self.head_dim = self.hidden_size // self.num_heads + if self.head_dim * self.num_heads != self.hidden_size: + raise ValueError(f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size} and `num_heads`:" f" {self.num_heads}).") + self.scale = self.head_dim**-0.5 + self.dropout = nn.Dropout(config.attention_dropout) + + self.query_key_value = nn.Linear(self.hidden_size, 3 * self.hidden_size) + self.dense = nn.Linear(self.hidden_size, self.hidden_size) + + def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int): + return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous() + + def forward( + self, + hidden_states: torch.Tensor, + head_mask: Optional[torch.Tensor] = None, + output_attentions: Optional[bool] = False, + ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]: + """Input shape: Batch x Time x Channel""" + + bsz, seq_len, embed_dim = hidden_states.size() + + mixed_qkv = self.query_key_value(hidden_states) + + mixed_qkv = mixed_qkv.reshape(bsz, seq_len, self.num_heads, 3, embed_dim // self.num_heads).permute(3, 0, 2, 1, 4) # [3, b, np, sq, hn] + query_states, key_states, value_states = ( + mixed_qkv[0], + mixed_qkv[1], + mixed_qkv[2], + ) + # if self.config.use_flash_attn and flash_attn_func is not None: + if False: + # [b*sq, np, hn] + query_states = query_states.permute(0, 2, 1, 3).contiguous() + query_states = query_states.view(query_states.size(0) * query_states.size(1), query_states.size(2), -1) + + key_states = key_states.permute(0, 2, 1, 3).contiguous() + key_states = key_states.view(key_states.size(0) * key_states.size(1), key_states.size(2), -1) + + value_states = value_states.permute(0, 2, 1, 3).contiguous() + value_states = value_states.view(value_states.size(0) * value_states.size(1), value_states.size(2), -1) + + cu_seqlens = torch.arange(0, (bsz + 1) * seq_len, step=seq_len, dtype=torch.int32, device=query_states.device) + + context_layer = flash_attn_func( + query_states, + key_states, + value_states, + cu_seqlens, + cu_seqlens, + seq_len, + seq_len, + self.dropout if self.training else 0.0, + softmax_scale=self.scale, + causal=False, + return_attn_probs=False, + ) + # [b*sq, np, hn] => [b, sq, np, hn] + context_layer = context_layer.view(bsz, seq_len, context_layer.size(1), context_layer.size(2)) + else: + # Take the dot product between "query" and "key" to get the raw attention scores. + attention_scores = torch.matmul(query_states, key_states.transpose(-1, -2)) + + attention_scores = attention_scores * self.scale + + # Normalize the attention scores to probabilities. + attention_probs = torch.softmax(attention_scores, dim=-1) + + # This is actually dropping out entire tokens to attend to, which might + # seem a bit unusual, but is taken from the original Transformer paper. + attention_probs = self.dropout(attention_probs) + + # Mask heads if we want to + if head_mask is not None: + attention_probs = attention_probs * head_mask + + context_layer = torch.matmul(attention_probs, value_states).permute(0, 2, 1, 3) + + new_context_layer_shape = context_layer.size()[:-2] + (self.hidden_size,) + context_layer = context_layer.reshape(new_context_layer_shape) + + output = self.dense(context_layer) + + outputs = (output, attention_probs) if output_attentions else (output, None) + + return outputs + + +class MplugOwlMLP(nn.Module): + def __init__(self, config): + super().__init__() + self.config = config + self.activation_fn = QuickGELU() + self.fc1 = nn.Linear(config.hidden_size, config.intermediate_size) + self.fc2 = nn.Linear(config.intermediate_size, config.hidden_size) + + def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: + hidden_states = self.fc1(hidden_states) + hidden_states = self.activation_fn(hidden_states) + hidden_states = self.fc2(hidden_states) + return hidden_states + + +class MplugOwlVisionEncoderLayer(nn.Module): + def __init__(self, config: MplugOwlVisionConfig): + super().__init__() + self.hidden_size = config.hidden_size + self.temporal = MplugOwlVisionLocalTemporal(config) + self.self_attn = MplugOwlVisionAttention(config) + self.input_layernorm = LayerNormFp32(self.hidden_size, eps=config.layer_norm_eps) + self.mlp = MplugOwlMLP(config) + self.post_attention_layernorm = LayerNormFp32(self.hidden_size, eps=config.layer_norm_eps) + + def forward( + self, + hidden_states: torch.Tensor, + attention_mask: torch.Tensor, + output_attentions: Optional[bool] = False, + ) -> Tuple[torch.FloatTensor]: + """ + Args: + hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, time, seq_len, embed_dim)` + attention_mask (`torch.FloatTensor`): attention mask of size + `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values. + `(config.encoder_attention_heads,)`. + output_attentions (`bool`, *optional*): + Whether or not to return the attentions tensors of all attention layers. See `attentions` under + returned tensors for more detail. + """ + B, T = hidden_states.size(0), hidden_states.size(1) + if T > 1: + hidden_states = hidden_states + self.temporal(hidden_states) + hidden_states = einops.rearrange(hidden_states, "b t n d -> (b t) n d") + + residual = hidden_states + + hidden_states = self.input_layernorm(hidden_states) + hidden_states, attn_weights = self.self_attn( + hidden_states=hidden_states, + head_mask=attention_mask, + output_attentions=output_attentions, + ) + hidden_states = hidden_states + residual + residual = hidden_states + hidden_states = self.post_attention_layernorm(hidden_states) + hidden_states = self.mlp(hidden_states) + + hidden_states = hidden_states + residual + hidden_states = einops.rearrange(hidden_states, "(b t) n d -> b t n d", b=B) + + outputs = (hidden_states,) + + if output_attentions: + outputs += (attn_weights,) + + return outputs + + +class MplugOwlPreTrainedModel(PreTrainedModel): + """ + An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained + models. + """ + + config_class = MplugOwlConfig + base_model_prefix = "mplug_owl" + supports_gradient_checkpointing = True + _keys_to_ignore_on_load_missing = [ + r"position_ids", + r"language_model.encoder.embed_tokens.weight", + r"language_model.decoder.embed_tokens.weight", + r"language_model.lm_head.weight", + ] + _no_split_modules = [ + "MplugOwlVisionEncoderLayer", + "LlamaDecoderLayer", + "MplugOwlVisualAbstractorLayer", + "LlamaForCausalLM", + "Parameter", + ] + _keep_in_fp32_modules = ["wo"] + + def _init_weights(self, module): + """Initialize the weights""" + factor = self.config.initializer_range + if isinstance(module, nn.Conv2d) or isinstance(module, nn.Embedding) or isinstance(module, nn.Linear): + module.weight.data.normal_(mean=0.0, std=factor) + if hasattr(module, "bias") and module.bias is not None: + module.bias.data.zero_() + + if isinstance(module, MplugOwlVisionEmbeddings): + if hasattr(self.config, "vision_config"): + factor = self.config.vision_config.initializer_range + nn.init.trunc_normal_(module.position_embedding, mean=0.0, std=factor) + nn.init.trunc_normal_(module.cls_token, mean=0.0, std=factor) + + elif isinstance(module, nn.LayerNorm): + module.bias.data.zero_() + module.weight.data.fill_(1.0) + elif isinstance(module, nn.Linear) and module.bias is not None: + module.bias.data.zero_() + elif isinstance(module, nn.Parameter): + raise ValueError + nn.init.trunc_normal_(module.data, mean=0.0, std=factor) + + def _set_gradient_checkpointing(self, module, value=False): + if isinstance(module, MplugOwlVisionEncoder): + module.gradient_checkpointing = value + + +MPLUG_OWL_START_DOCSTRING = r""" + This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the + library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads + etc.) + + This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. + Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage + and behavior. + + Parameters: + config ([`MplugOwlConfig`]): Model configuration class with all the parameters of the model. + Initializing with a config file does not load the weights associated with the model, only the + configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights. +""" + +MPLUG_OWL_VISION_INPUTS_DOCSTRING = r""" + Args: + pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`): + Pixel values. Pixel values can be obtained using [`MplugOwlProcessor`]. See [`MplugOwlProcessor.__call__`] for + details. + output_attentions (`bool`, *optional*): + Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned + tensors for more detail. + output_hidden_states (`bool`, *optional*): + Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for + more detail. + return_dict (`bool`, *optional*): + Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. +""" + +MPLUG_OWL_TEXT_INPUTS_DOCSTRING = r""" + Args: + input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`): + Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide + it. Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and + [`PreTrainedTokenizer.__call__`] for details. [What are input IDs?](../glossary#input-ids) + attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*): + Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: + - 1 for tokens that are **not masked**, + - 0 for tokens that are **masked**. + [What are attention masks?](../glossary#attention-mask) + decoder_input_ids (`torch.LongTensor` of shape `(batch_size, target_sequence_length)`, *optional*): + Indices of decoder input sequence tokens in the vocabulary. + + Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and + [`PreTrainedTokenizer.__call__`] for details. + + [What are decoder input IDs?](../glossary#decoder-input-ids) + + T5 uses the `pad_token_id` as the starting token for `decoder_input_ids` generation. If `past_key_values` + is used, optionally only the last `decoder_input_ids` have to be input (see `past_key_values`). + + To know more on how to prepare `decoder_input_ids` for pretraining take a look at [T5 + Training](./t5#training). + decoder_attention_mask (`torch.BoolTensor` of shape `(batch_size, target_sequence_length)`, *optional*): + Default behavior: generate a tensor that ignores pad tokens in `decoder_input_ids`. Causal mask will also + be used by default. + output_attentions (`bool`, *optional*): + Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned + tensors for more detail. + output_hidden_states (`bool`, *optional*): + Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for + more detail. + return_dict (`bool`, *optional*): + Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. +""" + +MPLUG_OWL_INPUTS_DOCSTRING = r""" + Args: + pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`): + Pixel values. Pixel values can be obtained using [`MplugOwlProcessor`]. See [`MplugOwlProcessor.__call__`] for + details. + + input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*): + Indices of input sequence tokens in the vocabulary of the language model. Input tokens can optionally be + provided to serve as text prompt, which the language model can continue. + + Indices can be obtained using [`MplugOwlProcessor`]. See [`MplugOwlProcessor.__call__`] for details. + + [What are input IDs?](../glossary#input-ids) + attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*): + Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: + + - 1 for tokens that are **not masked**, + - 0 for tokens that are **masked**. + + [What are attention masks?](../glossary#attention-mask) + + decoder_input_ids (`torch.LongTensor` of shape `(batch_size, target_sequence_length)`, *optional*): + Indices of decoder input sequence tokens in the vocabulary of the language model. Only relevant in case an + encoder-decoder language model (like T5) is used. + + Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and + [`PreTrainedTokenizer.__call__`] for details. [What are decoder input IDs?](../glossary#decoder-input-ids) + + decoder_attention_mask (`torch.BoolTensor` of shape `(batch_size, target_sequence_length)`, *optional*): + Default behavior: generate a tensor that ignores pad tokens in `decoder_input_ids`. Causal mask will also + be used by default. + + Only relevant in case an encoder-decoder language model (like T5) is used. + + output_attentions (`bool`, *optional*): + Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned + tensors for more detail. + output_hidden_states (`bool`, *optional*): + Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for + more detail. + return_dict (`bool`, *optional*): + Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. +""" + + +class MplugOwlVisionEncoder(nn.Module): + """ + Transformer encoder consisting of `config.num_hidden_layers` self attention layers. Each layer is a + [`MplugOwlVisionEncoderLayer`]. + + Args: + config (`MplugOwlVisionConfig`): + The corresponding vision configuration for the `MplugOwlEncoder`. + """ + + def __init__(self, config: MplugOwlVisionConfig): + super().__init__() + self.config = config + self.layers = nn.ModuleList([MplugOwlVisionEncoderLayer(config) for _ in range(config.num_hidden_layers)]) + self.gradient_checkpointing = False + + def forward( + self, + inputs_embeds, + attention_mask: Optional[torch.Tensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, BaseModelOutput]: + r""" + Args: + inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): + Embedded representation of the inputs. Should be float, not int tokens. + attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*): + Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: + + - 1 for tokens that are **not masked**, + - 0 for tokens that are **masked**. + + [What are attention masks?](../glossary#attention-mask) + output_attentions (`bool`, *optional*): + Whether or not to return the attentions tensors of all attention layers. See `attentions` under + returned tensors for more detail. + output_hidden_states (`bool`, *optional*): + Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors + for more detail. + return_dict (`bool`, *optional*): + Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. + """ + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + encoder_states = () if output_hidden_states else None + all_attentions = () if output_attentions else None + + hidden_states = inputs_embeds + for idx, encoder_layer in enumerate(self.layers): + if output_hidden_states: + encoder_states = encoder_states + (hidden_states,) + if self.gradient_checkpointing and self.training: + + def create_custom_forward(module): + def custom_forward(*inputs): + return module(*inputs, output_attentions) + + return custom_forward + + layer_outputs = torch.utils.checkpoint.checkpoint( + create_custom_forward(encoder_layer), + hidden_states, + attention_mask, + ) + else: + layer_outputs = encoder_layer( + hidden_states, + attention_mask, + output_attentions=output_attentions, + ) + + hidden_states = layer_outputs[0] + + if output_attentions: + all_attentions = all_attentions + (layer_outputs[1],) + + if output_hidden_states: + encoder_states = encoder_states + (hidden_states,) + + if not return_dict: + return tuple(v for v in [hidden_states, encoder_states, all_attentions] if v is not None) + return BaseModelOutput(last_hidden_state=hidden_states, hidden_states=encoder_states, attentions=all_attentions) + + +class MplugOwlVisionModel(MplugOwlPreTrainedModel): + main_input_name = "pixel_values" + config_class = MplugOwlVisionConfig + + def __init__(self, config: MplugOwlVisionConfig): + super().__init__(config) + self.config = config + self.hidden_size = config.hidden_size + + self.embeddings = MplugOwlVisionEmbeddings(config) + self.encoder = MplugOwlVisionEncoder(config) + self.post_layernorm = LayerNormFp32(self.hidden_size, eps=config.layer_norm_eps) + + self.post_init() + + @add_start_docstrings_to_model_forward(MPLUG_OWL_VISION_INPUTS_DOCSTRING) + @replace_return_docstrings(output_type=BaseModelOutputWithPooling, config_class=MplugOwlVisionConfig) + def forward( + self, + pixel_values: Optional[torch.FloatTensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, BaseModelOutputWithPooling]: + r""" + Returns: + + """ + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + if pixel_values is None: + raise ValueError("You have to specify pixel_values") + + hidden_states = self.embeddings(pixel_values) # [B, T, N, D] + + encoder_outputs = self.encoder( + inputs_embeds=hidden_states, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + last_hidden_state = encoder_outputs[0] + last_hidden_state = self.post_layernorm(last_hidden_state) + + pooled_output = last_hidden_state[:, :, 0, :].mean(1) + pooled_output = self.post_layernorm(pooled_output) + + if not return_dict: + return (last_hidden_state, pooled_output) + encoder_outputs[1:] + + return BaseModelOutputWithPooling( + last_hidden_state=last_hidden_state, + pooler_output=pooled_output, + hidden_states=encoder_outputs.hidden_states, + attentions=encoder_outputs.attentions, + ) + + def get_input_embeddings(self): + return self.embeddings + + +class MplugOwlVisualAbstractorMLP(nn.Module): + def __init__(self, config: MplugOwlVisualAbstractorConfig): + super().__init__() + self.config = config + in_features = config.hidden_size + hidden_features = config.intermediate_size + if hidden_features != 2816: + hidden_features = int(2 * hidden_features / 3) + multiple_of = 256 + hidden_features = multiple_of * ((hidden_features + multiple_of - 1) // multiple_of) + self.act = nn.SiLU() + + self.w1 = nn.Linear(in_features, hidden_features) + self.w2 = nn.Linear(hidden_features, in_features) + self.w3 = nn.Linear(in_features, hidden_features) + self.ffn_ln = LayerNormFp32(hidden_features, eps=config.layer_norm_eps) + + def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: + hidden_states = self.act(self.w1(hidden_states)) * self.w3(hidden_states) + hidden_states = self.ffn_ln(hidden_states) + hidden_states = self.w2(hidden_states) + return hidden_states + + +class MplugOwlVisualAbstractorMultiHeadAttention(nn.Module): + def __init__(self, config: MplugOwlVisualAbstractorConfig): + super().__init__() + self.config = config + if config.hidden_size % config.num_attention_heads != 0: + raise ValueError("The hidden size (%d) is not a multiple of the number of attention heads (%d)" % (config.hidden_size, config.num_attention_heads)) + + self.num_attention_heads = config.num_attention_heads + self.attention_head_size = int(config.hidden_size / config.num_attention_heads) + self.all_head_size = self.num_attention_heads * self.attention_head_size + + self.query = nn.Linear(config.hidden_size, self.all_head_size) + self.key = nn.Linear(config.encoder_hidden_size, self.all_head_size) + self.value = nn.Linear(config.encoder_hidden_size, self.all_head_size) + + self.dropout = nn.Dropout(config.attention_probs_dropout_prob) + self.save_attention = False + + def save_attn_gradients(self, attn_gradients): + self.attn_gradients = attn_gradients + + def get_attn_gradients(self): + return self.attn_gradients + + def save_attention_map(self, attention_map): + self.attention_map = attention_map + + def get_attention_map(self): + return self.attention_map + + def transpose_for_scores(self, x): + new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size) + x = x.view(*new_x_shape) + return x.permute(0, 2, 1, 3) + + def forward( + self, + hidden_states, + attention_mask=None, + head_mask=None, + encoder_hidden_states=None, + encoder_attention_mask=None, + past_key_value=None, + output_attentions=False, + ): + # If this is instantiated as a cross-attention module, the keys + # and values come from an encoder; the attention mask needs to be + # such that the encoder's padding tokens are not attended to. + key_layer = self.transpose_for_scores(self.key(encoder_hidden_states)) + value_layer = self.transpose_for_scores(self.value(encoder_hidden_states)) + attention_mask = encoder_attention_mask + + mixed_query_layer = self.query(hidden_states) + + query_layer = self.transpose_for_scores(mixed_query_layer) + + past_key_value = (key_layer, value_layer) + + # Take the dot product between "query" and "key" to get the raw attention scores. + attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2)) + + attention_scores = attention_scores / math.sqrt(self.attention_head_size) + + if attention_mask is not None: + # Apply the attention mask is (precomputed for all layers in BertModel forward() function) + attention_scores = attention_scores + attention_mask + + # Normalize the attention scores to probabilities. + attention_probs = nn.Softmax(dim=-1)(attention_scores) + + if self.save_attention: + self.save_attention_map(attention_probs) + attention_probs.register_hook(self.save_attn_gradients) + + # This is actually dropping out entire tokens to attend to, which might + # seem a bit unusual, but is taken from the original Transformer paper. + attention_probs_dropped = self.dropout(attention_probs) + + # Mask heads if we want to + if head_mask is not None: + attention_probs_dropped = attention_probs_dropped * head_mask + + context_layer = torch.matmul(attention_probs_dropped, value_layer) + + context_layer = context_layer.permute(0, 2, 1, 3).contiguous() + new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,) + context_layer = context_layer.view(*new_context_layer_shape) + + outputs = (context_layer, attention_probs) if output_attentions else (context_layer,) + + outputs = outputs + (past_key_value,) + return outputs + + +class MplugOwlVisualAbstractorCrossOutput(nn.Module): + def __init__(self, config: MplugOwlVisualAbstractorConfig): + super().__init__() + dim = config.hidden_size + self.out_proj = nn.Linear(dim, dim, bias=True) + self.norm2 = LayerNormFp32(dim) + self.mlp = MplugOwlVisualAbstractorMLP(config) + + def forward(self, hidden_states: torch.Tensor, input_tensor: torch.Tensor) -> torch.Tensor: + input_tensor = input_tensor + self.out_proj(hidden_states) + input_tensor = input_tensor + self.mlp(self.norm2(input_tensor)) + return input_tensor + + +class MplugOwlVisualAbstractorAttention(nn.Module): + def __init__(self, config: MplugOwlVisualAbstractorConfig): + super().__init__() + self.attention = MplugOwlVisualAbstractorMultiHeadAttention(config) + self.output = MplugOwlVisualAbstractorCrossOutput(config) + self.pruned_heads = set() + self.norm1 = LayerNormFp32(config.hidden_size) + self.normk = LayerNormFp32(config.hidden_size) + + def prune_heads(self, heads): + if len(heads) == 0: + return + heads, index = find_pruneable_heads_and_indices(heads, self.attention.num_attention_heads, self.attention.attention_head_size, self.pruned_heads) + + # Prune linear layers + self.attention.query = prune_linear_layer(self.attention.query, index) + self.attention.key = prune_linear_layer(self.attention.key, index) + self.attention.value = prune_linear_layer(self.attention.value, index) + self.output.dense = prune_linear_layer(self.output.out_proj, index, dim=1) + + # Update hyper params and store pruned heads + self.attention.num_attention_heads = self.attention.num_attention_heads - len(heads) + self.attention.all_head_size = self.attention.attention_head_size * self.attention.num_attention_heads + self.pruned_heads = self.pruned_heads.union(heads) + + def forward( + self, + hidden_states: torch.Tensor, + attention_mask: Optional[torch.FloatTensor] = None, + head_mask: Optional[torch.FloatTensor] = None, + encoder_hidden_states: Optional[torch.FloatTensor] = None, + encoder_attention_mask: Optional[torch.FloatTensor] = None, + past_key_value: Optional[Tuple[Tuple[torch.FloatTensor]]] = None, + output_attentions: Optional[bool] = False, + ) -> Tuple[torch.Tensor]: + # HACK we apply norm on q and k + hidden_states = self.norm1(hidden_states) + encoder_hidden_states = self.normk(encoder_hidden_states) + encoder_hidden_states = torch.cat([hidden_states, encoder_hidden_states], dim=1) + encoder_attention_mask = torch.cat([attention_mask, encoder_attention_mask], dim=-1) + self_outputs = self.attention( + hidden_states, + attention_mask, + head_mask, + encoder_hidden_states, + encoder_attention_mask, + past_key_value, + output_attentions, + ) + attention_output = self.output(self_outputs[0], hidden_states) + # add attentions if we output them + outputs = (attention_output,) + self_outputs[1:] + return outputs + + +class MplugOwlVisualAbstractorLayer(nn.Module): + def __init__(self, config, layer_idx): + super().__init__() + self.chunk_size_feed_forward = config.chunk_size_feed_forward + self.seq_len_dim = 1 + + self.layer_idx = layer_idx + + self.crossattention = MplugOwlVisualAbstractorAttention(config) + self.has_cross_attention = True + + def forward( + self, + hidden_states, + attention_mask=None, + head_mask=None, + encoder_hidden_states=None, + encoder_attention_mask=None, + output_attentions=False, + ): + if encoder_hidden_states is None: + raise ValueError("encoder_hidden_states must be given for cross-attention layers") + cross_attention_outputs = self.crossattention( + hidden_states, + attention_mask, + head_mask, + encoder_hidden_states, + encoder_attention_mask, + output_attentions=output_attentions, + ) + query_attention_output = cross_attention_outputs[0] + + outputs = (query_attention_output,) + return outputs + + +class MplugOwlVisualAbstractorEncoder(nn.Module): + def __init__(self, config): + super().__init__() + self.config = config + self.layers = nn.ModuleList([MplugOwlVisualAbstractorLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]) + self.gradient_checkpointing = False + + def forward( + self, + hidden_states, + attention_mask=None, + head_mask=None, + encoder_hidden_states=None, + encoder_attention_mask=None, + past_key_values=None, + output_attentions=False, + output_hidden_states=False, + return_dict=True, + ): + all_hidden_states = () if output_hidden_states else None + + for i in range(self.config.num_hidden_layers): + layer_module = self.layers[i] + if output_hidden_states: + all_hidden_states = all_hidden_states + (hidden_states,) + + layer_head_mask = head_mask[i] if head_mask is not None else None + past_key_value = past_key_values[i] if past_key_values is not None else None + + if getattr(self.config, "gradient_checkpointing", False) and self.training: + + def create_custom_forward(module): + def custom_forward(*inputs): + return module(*inputs, past_key_value, output_attentions) + + return custom_forward + + layer_outputs = torch.utils.checkpoint.checkpoint( + create_custom_forward(layer_module), + hidden_states, + attention_mask, + layer_head_mask, + encoder_hidden_states, + encoder_attention_mask, + ) + else: + layer_outputs = layer_module( + hidden_states, + attention_mask, + layer_head_mask, + encoder_hidden_states, + encoder_attention_mask, + output_attentions, + ) + + hidden_states = layer_outputs[0] + + return BaseModelOutput( + last_hidden_state=hidden_states, + ) + + +class MplugOwlVisualAbstractorModel(MplugOwlPreTrainedModel): + def __init__(self, config: MplugOwlVisualAbstractorConfig, language_hidden_size): + super().__init__(config) + self.config = config + + self.encoder = MplugOwlVisualAbstractorEncoder(config) + self.visual_fc = torch.nn.Linear(config.hidden_size, language_hidden_size) + self.temporal_visual_fc = torch.nn.Linear(config.hidden_size, language_hidden_size) + self.vit_eos = torch.nn.Parameter(torch.randn(1, 1, language_hidden_size)) + nn.init.trunc_normal_(self.vit_eos, mean=0.0, std=self.config.initializer_range) + self.post_init() + + def _prune_heads(self, heads_to_prune): + """ + Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base + class PreTrainedModel + """ + for layer, heads in heads_to_prune.items(): + self.encoder.layer[layer].attention.prune_heads(heads) + + def get_extended_attention_mask( + self, + attention_mask: torch.Tensor, + input_shape: Tuple[int], + device: torch.device, + ) -> torch.Tensor: + """ + Makes broadcastable attention and causal masks so that future and masked tokens are ignored. + + Arguments: + attention_mask (`torch.Tensor`): + Mask with ones indicating tokens to attend to, zeros for tokens to ignore. + input_shape (`Tuple[int]`): + The shape of the input to the model. + device: (`torch.device`): + The device of the input to the model. + + Returns: + `torch.Tensor` The extended attention mask, with a the same dtype as `attention_mask.dtype`. + """ + # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length] + # ourselves in which case we just need to make it broadcastable to all heads. + if attention_mask.dim() == 3: + extended_attention_mask = attention_mask[:, None, :, :] + elif attention_mask.dim() == 2: + # Provided a padding mask of dimensions [batch_size, seq_length] + # - the model is an encoder, so make the mask broadcastable to [batch_size, num_heads, seq_length, seq_length] + extended_attention_mask = attention_mask[:, None, None, :] + else: + raise ValueError("Wrong shape for input_ids (shape {}) or attention_mask (shape {})".format(input_shape, attention_mask.shape)) + + # Since attention_mask is 1.0 for positions we want to attend and 0.0 for + # masked positions, this operation will create a tensor which is 0.0 for + # positions we want to attend and -10000.0 for masked positions. + # Since we are adding it to the raw scores before the softmax, this is + # effectively the same as removing these entirely. + extended_attention_mask = extended_attention_mask.to(dtype=self.dtype) # fp16 compatibility + extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0 + return extended_attention_mask + + def forward( + self, + query_embeds, + temporal_query_embeds=None, + attention_mask=None, + head_mask=None, + encoder_hidden_states=None, + encoder_attention_mask=None, + past_key_values=None, + output_attentions=None, + output_hidden_states=None, + return_dict=None, + ): + r""" + encoder_hidden_states (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, `optional`): + Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if + the model is configured as a decoder. + encoder_attention_mask (`torch.FloatTensor` of shape `(batch_size, sequence_length)`, `optional`): + Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in + the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`: + - 1 for tokens that are **not masked**, + - 0 for tokens that are **masked**. + past_key_values (`tuple(tuple(torch.FloatTensor))` of length `config.n_layers` with each tuple having 4 tensors of: + shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`): Contains precomputed key and + value hidden states of the attention blocks. Can be used to speed up decoding. If `past_key_values` are + used, the user can optionally input only the last `decoder_input_ids` (those that don't have their past key + value states given to this model) of shape `(batch_size, 1)` instead of all `decoder_input_ids` of shape + `(batch_size, sequence_length)`. + """ + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + T = encoder_hidden_states.size(1) + if T == 1 or temporal_query_embeds is None: + embedding_output = query_embeds + else: + embedding_output = torch.cat([query_embeds, temporal_query_embeds], dim=1) + input_shape = embedding_output.size()[:-1] + batch_size, seq_length = input_shape + device = embedding_output.device + + encoder_hidden_states = einops.rearrange(encoder_hidden_states, "b t n d -> b (t n) d") + + # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length] + # ourselves in which case we just need to make it broadcastable to all heads. + if attention_mask is None: + attention_mask = torch.ones((embedding_output.shape[0], embedding_output.shape[1]), dtype=torch.long, device=embedding_output.device) + extended_attention_mask = self.get_extended_attention_mask(attention_mask, input_shape, device) + + # If a 2D or 3D attention mask is provided for the cross-attention + # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length] + if encoder_hidden_states is not None: + if type(encoder_hidden_states) == list: + encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states[0].size() + else: + ( + encoder_batch_size, + encoder_sequence_length, + _, + ) = encoder_hidden_states.size() + encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length) + + if type(encoder_attention_mask) == list: + encoder_extended_attention_mask = [self.invert_attention_mask(mask) for mask in encoder_attention_mask] + elif encoder_attention_mask is None: + encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device) + encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask) + else: + encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask) + else: + encoder_extended_attention_mask = None + + # Prepare head mask if needed + # 1.0 in head_mask indicate we keep the head + # attention_probs has shape bsz x n_heads x N x N + # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads] + # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length] + head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers) + + encoder_outputs = self.encoder( + embedding_output, + attention_mask=extended_attention_mask, + head_mask=head_mask, + encoder_hidden_states=encoder_hidden_states, + encoder_attention_mask=encoder_extended_attention_mask, + past_key_values=past_key_values, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + sequence_output = encoder_outputs[0] + pooled_output = sequence_output[:, 0, :] + + if T == 1 or temporal_query_embeds is None: + temporal_sequence_output = None + else: + temporal_sequence_output = sequence_output[:, query_embeds.size(1) :] + sequence_output = sequence_output[:, : query_embeds.size(1)] + + sequence_output = self.visual_fc(sequence_output) + if temporal_sequence_output is not None: + sequence_output += self.temporal_visual_fc(temporal_sequence_output) + sequence_output = torch.cat([sequence_output, self.vit_eos.repeat(sequence_output.shape[0], 1, 1)], dim=1) + + return BaseModelOutputWithPooling( + last_hidden_state=sequence_output, + pooler_output=pooled_output, + hidden_states=encoder_outputs.hidden_states, + ) + + +@add_start_docstrings( + """ + mPLUG-Owl Model for generating text and image features. The model consists of a vision encoder, Querying Transformer + (Q-Former) and a language model. + """, + MPLUG_OWL_START_DOCSTRING, +) +class MplugOwlModel(MplugOwlPreTrainedModel): + config_class = MplugOwlConfig + main_input_name = "pixel_values" + + def __init__(self, config: MplugOwlConfig, *inputs, **kwargs): + super().__init__(config, *inputs, **kwargs) + + self.vision_model = MplugOwlVisionModel(config.vision_config) + + self.query_tokens = nn.Parameter(torch.zeros(1, config.num_query_tokens, config.visual_abstractor_config.hidden_size)) + self.temporal_query_tokens = nn.Parameter(torch.zeros(1, config.num_query_tokens, config.visual_abstractor_config.hidden_size)) + self.abstractor = MplugOwlVisualAbstractorModel(config.visual_abstractor_config, config.text_config.hidden_size) + + # if config.use_decoder_only_language_model: + # from llama.modeling_llama import LlamaForCausalLM + language_model = AutoModelForCausalLM.from_config(config.text_config) + # else: + # language_model = AutoModelForSeq2SeqLM.from_config(config.text_config) + self.language_model = language_model + + # Initialize weights and apply final processing + self.post_init() + + def get_input_embeddings(self): + return self.language_model.get_input_embeddings() + + def set_input_embeddings(self, value): + self.language_model.set_input_embeddings(value) + + def set_output_embeddings(self, new_embeddings): + self.language_model.set_output_embeddings(new_embeddings) + + def get_output_embeddings(self) -> nn.Module: + return self.language_model.get_output_embeddings() + + def get_encoder(self): + return self.language_model.get_encoder() + + def get_decoder(self): + return self.language_model.get_decoder() + + def _tie_weights(self): + if not self.config.use_decoder_only_language_model: + self.language_model.encoder.embed_tokens = self.language_model.shared + self.language_model.decoder.embed_tokens = self.language_model.shared + + def get_text_features( + self, + input_ids: Optional[torch.Tensor] = None, + attention_mask: Optional[torch.Tensor] = None, + decoder_input_ids: Optional[torch.Tensor] = None, + decoder_attention_mask: Optional[torch.Tensor] = None, + labels: Optional[torch.Tensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ): + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + if self.config.use_decoder_only_language_model: + text_outputs = self.language_model( + input_ids=input_ids, + attention_mask=attention_mask, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + else: + inputs_embeds = self.language_model.get_input_embeddings()(input_ids) + + text_outputs = self.language_model( + inputs_embeds=inputs_embeds, + attention_mask=attention_mask, + decoder_input_ids=decoder_input_ids, + decoder_attention_mask=decoder_attention_mask, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + labels=labels, + ) + + return text_outputs + + def get_image_features( + self, + pixel_values: Optional[torch.FloatTensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ): + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + vision_outputs = self.vision_model( + pixel_values=pixel_values, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + return vision_outputs + + +def get_media_indices(my_list): + if isinstance(my_list, torch.Tensor): + my_list = my_list.cpu().tolist() + result = [] + for i in range(len(my_list)): + if i == 0 and my_list[i] < 0: + result.append(i) + elif my_list[i] != my_list[i - 1] and my_list[i] < 0: + result.append(i) + return result + + +def get_media_types(tensors, positions): + if isinstance(tensors, torch.Tensor): + tensors = tensors.cpu().tolist() + result = [] + for pos in positions: + result.append(tensors[pos]) + return result + + +@add_start_docstrings( + """ + mPLUG-Owl Model for generating text given an image and an optional text prompt. + """, + MPLUG_OWL_START_DOCSTRING, +) +class MplugOwlForConditionalGeneration(MplugOwlPreTrainedModel): + config_class = MplugOwlConfig + main_input_name = "pixel_values" + + def __init__(self, config: MplugOwlConfig): + super().__init__(config) + + self.vision_model = MplugOwlVisionModel(config.vision_config) + + self.query_tokens = nn.Parameter(torch.zeros(1, config.num_query_tokens, config.visual_abstractor_config.hidden_size)) + self.temporal_query_tokens = nn.Parameter(torch.zeros(1, config.num_query_tokens, config.visual_abstractor_config.hidden_size)) + self.abstractor = MplugOwlVisualAbstractorModel(config.visual_abstractor_config, config.text_config.hidden_size) + + # if config.use_decoder_only_language_model: + # from llama.modeling_llama import LlamaForCausalLM + language_model = AutoModelForCausalLM.from_config(config.text_config) + # else: + # language_model = AutoModelForSeq2SeqLM.from_config(config.text_config) + self.language_model = language_model + + # Initialize weights and apply final processing + self.post_init() + self.main_input_name = "input_ids" + from transformers import GenerationConfig + + self.generation_config = GenerationConfig(max_length=512, do_sample=True, top_k=3, pad_token_id=0, unk_token_id=0, bos_token_id=1, eos_token_id=2) + + # Hack Bloom + if config.text_config.model_type == "bloom": + bound_method = bloom_forward.__get__(self.language_model.transformer, self.language_model.transformer.__class__) + setattr(self.language_model.transformer, "forward", bound_method) + + def get_input_embeddings(self): + return self.language_model.get_input_embeddings() + + def set_input_embeddings(self, value): + self.language_model.set_input_embeddings(value) + + def set_output_embeddings(self, new_embeddings): + self.language_model.set_output_embeddings(new_embeddings) + + def get_output_embeddings(self) -> nn.Module: + return self.language_model.get_output_embeddings() + + def get_encoder(self): + return self.language_model.get_encoder() + + def get_decoder(self): + return self.language_model.get_decoder() + + def _tie_weights(self): + if not self.config.use_decoder_only_language_model: + self.language_model.encoder.embed_tokens = self.language_model.shared + self.language_model.decoder.embed_tokens = self.language_model.shared + + def _preprocess_accelerate(self): + r""" + Some pre-processing hacks to make the model `accelerate` compatible. Check + https://github.com/huggingface/transformers/pull/21707 for more details. + """ + hf_device_map = self.hf_device_map + + if len(hf_device_map) > 1 and "language_model" not in hf_device_map and torch.cuda.device_count() > 1: + # warn users about unexpected behavior when using multi-GPU + mPLUG-Owl + `accelerate`. + logger.warning( + "The `language_model` is not in the `hf_device_map` dictionary and you are running your script" + " in a multi-GPU environment. this may lead to unexpected behavior when using `accelerate`." + " Please pass a `device_map` that contains `language_model` to remove this warning." + " Please refer to https://github.com/huggingface/blog/blob/main/accelerate-large-models.md for" + " more details on creating a `device_map` for large models.", + ) + + if hasattr(self.language_model, "_hf_hook"): + self.language_model._hf_hook.io_same_device = True # For `generate` compatibility + + @add_start_docstrings_to_model_forward(MPLUG_OWL_INPUTS_DOCSTRING) + @replace_return_docstrings(output_type=MplugOwlForConditionalGenerationModelOutput, config_class=MplugOwlVisionConfig) + def forward( + self, + pixel_values: torch.FloatTensor, + video_pixel_values: torch.FloatTensor, + input_ids: torch.FloatTensor, + num_images, + num_videos, + non_padding_mask: Optional[torch.LongTensor] = None, + non_media_mask: Optional[torch.LongTensor] = None, + prompt_mask: Optional[torch.LongTensor] = None, + attention_mask: Optional[torch.LongTensor] = None, + decoder_input_ids: Optional[torch.LongTensor] = None, + decoder_attention_mask: Optional[torch.LongTensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + labels: Optional[torch.LongTensor] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, MplugOwlForConditionalGenerationModelOutput]: + r""" + Returns: + + Examples: + + Image captioning (without providing a text prompt): + + ```python + >>> from PIL import Image + >>> import requests + >>> from transformers import MplugOwlProcessor, MplugOwlForConditionalGeneration + >>> import torch + + >>> device = "cuda" if torch.cuda.is_available() else "cpu" + + >>> processor = MplugOwlProcessor.from_pretrained("x-plug/x_plug-llama-7b") + >>> model = MplugOwlForConditionalGeneration.from_pretrained( + ... "x-plug/x_plug-llama-7b", torch_dtype=torch.float16 + ... ) + >>> model.to(device) # doctest: +IGNORE_RESULT + + >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" + >>> image = Image.open(requests.get(url, stream=True).raw) + + >>> inputs = processor(images=image, return_tensors="pt").to(device, torch.float16) + + >>> generated_ids = model.generate(**inputs) + >>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip() + >>> print(generated_text) + two cats laying on a couch + ``` + + Visual question answering (prompt = question): + + ```python + >>> from PIL import Image + >>> import requests + >>> from transformers import MplugOwlProcessor, MplugOwlForConditionalGeneration + >>> import torch + + >>> device = "cuda" if torch.cuda.is_available() else "cpu" + + >>> processor = MplugOwlProcessor.from_pretrained("x-plug/x_plug-llama-7b") + >>> model = MplugOwlForConditionalGeneration.from_pretrained( + ... "x-plug/x_plug-llama-7b", torch_dtype=torch.float16 + ... ) + >>> model.to(device) # doctest: +IGNORE_RESULT + + >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" + >>> image = Image.open(requests.get(url, stream=True).raw) + + >>> prompt = "Question: how many cats are there? Answer:" + >>> inputs = processor(images=image, text=prompt, return_tensors="pt").to(device, torch.float16) + + >>> generated_ids = model.generate(**inputs) + >>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip() + >>> print(generated_text) + two + ```""" + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + # get text embedding + text_tokens_ = input_ids.clone() + batch_size = input_ids.shape[0] + # labels = text_tokens_[:, 1:].clone().contiguous() + + media_token_indices = [ + # [:-1] since we would not use the last token for embedding + get_media_indices(text_tokens_[i][:-1]) + for i in range(batch_size) + ] + + media_token_types = [get_media_types(text_tokens_[i][:-1], media_token_indices[i]) for i in range(batch_size)] + + text_tokens_[text_tokens_ < 0] = 1 # Not used + # text_tokens = text_tokens_[:, :-1].contiguous() + text_embeds = self.get_input_embeddings()(text_tokens_) # Temporally Embedding + + if pixel_values is not None: + image_embeds = self.vision_model(pixel_values, return_dict=True).last_hidden_state + + image_attention_mask = torch.ones(image_embeds.size()[:-1], dtype=torch.long, device=image_embeds.device) + query_tokens = self.query_tokens.expand(image_embeds.shape[0], -1, -1) + temporal_query_tokens = self.temporal_query_tokens.expand(image_embeds.shape[0], -1, -1) + + query_features = self.abstractor( + query_embeds=query_tokens, + encoder_hidden_states=image_embeds, + encoder_attention_mask=image_attention_mask, + )["last_hidden_state"] + img_seq_length = query_features.shape[1] + + if video_pixel_values is not None: + video_embeds = self.vision_model(video_pixel_values, return_dict=True).last_hidden_state + + video_attention_mask = torch.ones(video_embeds.size()[:-1], dtype=torch.long, device=video_embeds.device) + video_attention_mask = einops.rearrange(video_attention_mask, "b t n -> b (t n)") + query_tokens = self.query_tokens.expand(video_embeds.shape[0], -1, -1) + temporal_query_tokens = self.temporal_query_tokens.expand(video_embeds.shape[0], -1, -1) + + video_query_features = self.abstractor( + query_embeds=query_tokens, + temporal_query_embeds=temporal_query_tokens, + encoder_hidden_states=video_embeds, + encoder_attention_mask=video_attention_mask, + )["last_hidden_state"] + vid_seq_length = video_query_features.shape[1] + + num_images_per_sample = num_images.long().cpu().tolist() + num_videos_per_sample = num_videos.long().cpu().tolist() + + text_chunk_embeds = [] + img_idx = 0 + for b in range(batch_size): + start = 0 + result = [] + if len(media_token_indices[b]) > 0: + for i, pos in enumerate(media_token_indices[b]): + if pos > start: + result.append(text_embeds[b, start:pos]) + result.append(query_features[img_idx + i]) + start = pos + img_seq_length + if start < text_embeds.shape[1]: + result.append(text_embeds[b, start:]) + + img_idx += num_images_per_sample[b] + text_chunk_embeds.append(torch.cat(result, dim=0)) + + # Actual Input Embeddings + input_embeds = torch.stack(text_chunk_embeds, dim=0) + + # if pixel_values is None and self.language_model.is_gradient_checkpointing: + # # Hack here when gradient checkpoint is enable. + # # Keep the compute graph static + # image_embeds = self.vision_model(torch.zeros(1,3,224,224,device=input_embeds.device,dtype=input_embeds.dtype), return_dict=True).last_hidden_state + # query_tokens = self.query_tokens.expand( + # image_embeds.shape[0], -1, -1) + # query_features = self.abstractor(query_embeds=query_tokens, + # encoder_hidden_states=image_embeds,)['last_hidden_state'] + + # input_embeds = input_embeds + query_features.mean()*0 + + # Create causal mask and position ids + _, loss_mask, position_ids = get_ltor_masks_and_position_ids_from_embeddings(input_embeds) + + # Calculate the loss_mask + non_padding_mask = non_padding_mask.long() + non_media_mask = non_media_mask.long() + prompt_mask = prompt_mask.long() # TODO How to deal with prompt mask + # from icecream import ic + # non_padding_mask = non_padding_mask[:,:-1] + # non_media_mask = non_media_mask[:,:-1] + # prompt_mask = prompt_mask[:,:-1] + # attention_mask = attention_mask[:,:-1] + loss_mask = loss_mask[:, :-1] + + loss_mask = loss_mask * non_padding_mask * non_media_mask * prompt_mask + labels[:, 1:][loss_mask != 1] = -100 + # Forward into GPT + outputs = self.language_model( + inputs_embeds=input_embeds, + attention_mask=attention_mask, + labels=labels, + return_dict=return_dict, + output_attentions=self.config.output_attentions, + ) + # outputs.loss = (outputs.loss * loss_mask.view(-1) + # ).sum()/loss_mask.sum() + return outputs + + @torch.no_grad() + def generate( + self, + pixel_values: torch.FloatTensor = None, + video_pixel_values: torch.FloatTensor = None, + input_ids: Optional[torch.LongTensor] = None, + attention_mask: Optional[torch.LongTensor] = None, + isdecoder=True, + **generate_kwargs, + ) -> torch.LongTensor: + """ + Overrides `generate` function to be able to use the model as a conditional generator. + + Args: + pixel_values (`torch.FloatTensor` of shape (batch_size, num_channels, height, width)): + Input images to be processed. + input_ids (`torch.LongTensor` of shape (batch_size, sequence_length), *optional*): + The sequence used as a prompt for the generation. + attention_mask (`torch.LongTensor` of shape (batch_size, sequence_length), *optional*): + Mask to avoid performing attention on padding token indices + + Returns: + captions (list): A list of strings of length batch_size * num_captions. + """ + if input_ids is None: + return self.language_model.generate(attention_mask=attention_mask, **generate_kwargs) + + if attention_mask is None: + attention_mask = input_ids.new_ones(*input_ids.shape) + + batch_size = input_ids.size(0) + media_token_indices = [get_media_indices(input_ids[i]) for i in range(batch_size)] + media_token_types = [get_media_types(input_ids[i], media_token_indices[i]) for i in range(batch_size)] + num_images_per_sample = [len([y for y in x if y == -1]) for x in media_token_types] + num_videos_per_sample = [len([y for y in x if y < -1]) for x in media_token_types] + input_ids = input_ids.clone() # prevent inplace modify + input_ids[input_ids < 0] = 0 # Not used + + if hasattr(self, "hf_device_map"): + # preprocess for `accelerate` + self._preprocess_accelerate() + batch_size = input_ids.shape[0] + # get text embedding + inputs_embeds = self.get_input_embeddings()(input_ids) + if hasattr(self.language_model, "transformer") and hasattr(self.language_model.transformer, "word_embeddings_layernorm"): + inputs_embeds = self.language_model.transformer.word_embeddings_layernorm(inputs_embeds) + # get visual embedding + if pixel_values is not None: + pixel_values = pixel_values.to(input_ids.device) + with torch.no_grad(): + image_embeds = self.vision_model(pixel_values, return_dict=True).last_hidden_state + image_attention_mask = torch.ones(image_embeds.size()[:-1], dtype=torch.long, device=image_embeds.device) + query_tokens = self.query_tokens.expand(image_embeds.shape[0], -1, -1) + query_outputs = self.abstractor( + query_embeds=query_tokens, + encoder_hidden_states=image_embeds, + encoder_attention_mask=image_attention_mask, + return_dict=True, + ) + query_output = query_outputs["last_hidden_state"] + image_embeds = query_output + img_seq_length = image_embeds.shape[1] + + if video_pixel_values is not None: + video_pixel_values = video_pixel_values.to(input_ids.device) + with torch.no_grad(): + video_embeds = self.vision_model(video_pixel_values, return_dict=True).last_hidden_state + video_attention_mask = torch.ones(video_embeds.size()[:-1], dtype=torch.long, device=video_embeds.device) + video_attention_mask = einops.rearrange(video_attention_mask, "b t n -> b (t n)") + query_tokens = self.query_tokens.expand(video_embeds.shape[0], -1, -1) + temporal_query_tokens = self.temporal_query_tokens.expand(video_embeds.shape[0], -1, -1) + query_outputs = self.abstractor( + query_embeds=query_tokens, + temporal_query_embeds=temporal_query_tokens, + encoder_hidden_states=video_embeds, + encoder_attention_mask=video_attention_mask, + return_dict=True, + ) + query_output = query_outputs["last_hidden_state"] + video_embeds = query_output + vid_seq_length = video_embeds.shape[1] + + # =================== + # Get actual input embeddings + # =================== + text_chunk_embeds = [] + text_chunk_attns = [] + img_idx = 0 + vid_idx = 0 + + for b in range(batch_size): + start = 0 + result = [] + result_attn = [] + for i, pos in enumerate(media_token_indices[b]): + curr_image_idx, curr_video_idx = 0, 0 + if pos > start: + result.append(inputs_embeds[b, start:pos]) + result_attn.append(attention_mask[b, start:pos]) + if media_token_types[b][i] == -1: + result.append(image_embeds[img_idx + curr_image_idx]) + result_attn.append(torch.ones(image_embeds[img_idx + curr_image_idx].shape[0], device=inputs_embeds.device)) + start = pos + img_seq_length + curr_image_idx += 1 + else: + result.append(video_embeds[vid_idx + curr_video_idx]) + result_attn.append(torch.ones(video_embeds[img_idx + curr_video_idx].shape[0], device=inputs_embeds.device)) + start = pos + vid_seq_length + curr_video_idx += 1 + if start < inputs_embeds.shape[1]: + result.append(inputs_embeds[b, start:]) + result_attn.append(attention_mask[b, start:]) + + img_idx += num_images_per_sample[b] + vid_idx += num_videos_per_sample[b] + text_chunk_embeds.append(torch.cat(result, dim=0)) + text_chunk_attns.append(torch.cat(result_attn, dim=0)) + inputs_embeds = torch.stack(text_chunk_embeds, dim=0) + attention_mask = torch.stack(text_chunk_attns, dim=0) + + outputs = self.language_model.generate( + inputs_embeds=inputs_embeds, + # input_ids=input_ids, + attention_mask=attention_mask, + **generate_kwargs, + ) + + return outputs + + def prepare_inputs_for_generation(self, input_ids, pixel_values=None, video_pixel_values=None, past_key_values=None, attention_mask=None, **model_kwargs): + input_shape = input_ids.shape + # if model is used as a decoder in encoder-decoder model, the decoder attention mask is created on the fly + if attention_mask is None: + attention_mask = input_ids.new_ones(input_shape) + + # # cut decoder_input_ids if past_key_values is used + # if past_key_values is not None: + # input_ids = input_ids[:, -1:] + + return { + "input_ids": input_ids, + "pixel_values": pixel_values, + "video_pixel_values": video_pixel_values, + "attention_mask": attention_mask, + # "past_key_values": past_key_values, + # "encoder_hidden_states": model_kwargs.get("encoder_hidden_states", None), + # "encoder_attention_mask": model_kwargs.get("encoder_attention_mask", None), + "is_decoder": True, + } + + +def bloom_forward( + self, + input_ids: Optional[torch.LongTensor] = None, + past_key_values: Optional[Tuple[Tuple[torch.Tensor, torch.Tensor], ...]] = None, + attention_mask: Optional[torch.Tensor] = None, + head_mask: Optional[torch.LongTensor] = None, + inputs_embeds: Optional[torch.LongTensor] = None, + use_cache: Optional[bool] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + **deprecated_arguments, +) -> Union[Tuple[torch.Tensor, ...], BaseModelOutputWithPastAndCrossAttentions]: + if deprecated_arguments.pop("position_ids", False) is not False: + # `position_ids` could have been `torch.Tensor` or `None` so defaulting pop to `False` allows to detect if users were passing explicitly `None` + warnings.warn( + "`position_ids` have no functionality in BLOOM and will be removed in v5.0.0. You can safely ignore" " passing `position_ids`.", + FutureWarning, + ) + if len(deprecated_arguments) > 0: + raise ValueError(f"Got unexpected arguments: {deprecated_arguments}") + + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + use_cache = use_cache if use_cache is not None else self.config.use_cache + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + if input_ids is not None and inputs_embeds is not None: + raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time") + elif input_ids is not None: + batch_size, seq_length = input_ids.shape + elif inputs_embeds is not None: + batch_size, seq_length, _ = inputs_embeds.shape + else: + raise ValueError("You have to specify either input_ids or inputs_embeds") + + if past_key_values is None: + past_key_values = tuple([None] * len(self.h)) + + # Prepare head mask if needed + # 1.0 in head_mask indicate we keep the head + # attention_probs has shape batch_size x num_heads x N x N + # head_mask has shape n_layer x batch x num_heads x N x N + head_mask = self.get_head_mask(head_mask, self.config.n_layer) + + if inputs_embeds is None: + inputs_embeds = self.word_embeddings(input_ids) + inputs_embeds = self.word_embeddings_layernorm(inputs_embeds) + + hidden_states = inputs_embeds + + presents = () if use_cache else None + all_self_attentions = () if output_attentions else None + all_hidden_states = () if output_hidden_states else None + + if self.gradient_checkpointing and self.training: + if use_cache: + logger.warning_once("`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...") + use_cache = False + + # Compute alibi tensor: check build_alibi_tensor documentation + seq_length_with_past = seq_length + past_key_values_length = 0 + if past_key_values[0] is not None: + past_key_values_length = past_key_values[0][0].shape[2] + seq_length_with_past = seq_length_with_past + past_key_values_length + if attention_mask is None: + attention_mask = torch.ones((batch_size, seq_length_with_past), device=hidden_states.device) + else: + attention_mask = attention_mask.to(hidden_states.device) + + alibi = self.build_alibi_tensor(attention_mask, self.num_heads, dtype=hidden_states.dtype) + + causal_mask = self._prepare_attn_mask( + attention_mask, + input_shape=(batch_size, seq_length), + past_key_values_length=past_key_values_length, + ) + + for i, (block, layer_past) in enumerate(zip(self.h, past_key_values)): + if output_hidden_states: + all_hidden_states = all_hidden_states + (hidden_states,) + + if self.gradient_checkpointing and self.training: + + def create_custom_forward(module): + def custom_forward(*inputs): + # None for past_key_value + return module(*inputs, use_cache=use_cache, output_attentions=output_attentions) + + return custom_forward + + outputs = torch.utils.checkpoint.checkpoint( + create_custom_forward(block), + hidden_states, + alibi, + causal_mask, + layer_past, + head_mask[i], + ) + else: + outputs = block( + hidden_states, + layer_past=layer_past, + attention_mask=causal_mask, + head_mask=head_mask[i], + use_cache=use_cache, + output_attentions=output_attentions, + alibi=alibi, + ) + + hidden_states = outputs[0] + if use_cache is True: + presents = presents + (outputs[1],) + + if output_attentions: + all_self_attentions = all_self_attentions + (outputs[2 if use_cache else 1],) + + # Add last hidden state + hidden_states = self.ln_f(hidden_states) + + if output_hidden_states: + all_hidden_states = all_hidden_states + (hidden_states,) + + if not return_dict: + return tuple(v for v in [hidden_states, presents, all_hidden_states, all_self_attentions] if v is not None) + + return BaseModelOutputWithPastAndCrossAttentions( + last_hidden_state=hidden_states, + past_key_values=presents, + hidden_states=all_hidden_states, + attentions=all_self_attentions, + ) diff --git a/lmms_eval/models/mplug_owl_video/processing_mplug_owl.py b/lmms_eval/models/mplug_owl_video/processing_mplug_owl.py new file mode 100644 index 00000000..38cbf023 --- /dev/null +++ b/lmms_eval/models/mplug_owl_video/processing_mplug_owl.py @@ -0,0 +1,262 @@ +import re +import torch +import torch.utils.checkpoint + +from transformers.processing_utils import ProcessorMixin +from transformers.tokenization_utils_base import BatchEncoding +from transformers.models.clip.image_processing_clip import CLIPImageProcessor +from .tokenization_mplug_owl import MplugOwlTokenizer + +from decord import VideoReader +import numpy as np +from PIL import Image +from lmms_eval.models.model_utils.load_video import read_video_pyav + + +def get_index(num_frames, num_segments): + seg_size = float(num_frames - 1) / num_segments + start = int(seg_size / 2) + offsets = np.array([start + int(np.round(seg_size * idx)) for idx in range(num_segments)]) + return offsets + + +def load_video(path, num_frames=4): + """vr = VideoReader(path, height=224, width=224) + total_frames = len(vr) + frame_indices = get_index(total_frames, num_frames) + images_group = list() + for frame_index in frame_indices: + img = Image.fromarray(vr[frame_index].asnumpy()).convert("RGB") + images_group.append(img) + return images_group""" + # Change a bit here from the original code + # I use pyav instead of decord because it is much more safer + # The operations here are the same, we load video and return a list of PIL Image + # Load video frames + video_frames = read_video_pyav(path, num_frm=num_frames) + target_h, target_w = 224, 224 + # If image shape is not as target, resize it + if video_frames.shape[-3] != target_h or video_frames.shape[-2] != target_w: + video_frames = torch.from_numpy(video_frames).permute(0, 3, 1, 2).float() + video_frames = torch.nn.functional.interpolate(video_frames, size=(target_h, target_w)) + video_frames = video_frames.permute(0, 2, 3, 1).to(torch.uint8).numpy() + video_frames = [Image.fromarray(frame) for frame in video_frames] + if len(video_frames) > num_frames: + video_frames = video_frames[:num_frames] + return video_frames + + +class MplugOwlProcessor(ProcessorMixin): + attributes = [] + tokenizer_class = "MplugOwlTokenizer" + + def __init__(self, image_processor=None, tokenizer=None, **kwargs): + super().__init__(**kwargs) + self.tokens_to_generate = 0 + self.image_processor = image_processor + self.tokenizer = tokenizer + self.add_BOS = True + + def __call__(self, text=None, images=None, videos=None, num_frames=4, return_tensors=None, **kwargs): + if text is None and images is None: + raise ValueError("You have to specify either text or images. Both cannot be none.") + + if text is not None: + encoding = tokenize_prompts( + prompts=text, + tokens_to_generate=self.tokens_to_generate, + add_BOS=self.add_BOS, + tokenizer=self.tokenizer, + ignore_dist=True, + **kwargs, + ) + # encoding = self.tokenizer(text, return_tensors=return_tensors, **kwargs) + + if images is not None: + image_features = self.image_processor(images, return_tensors=return_tensors, **kwargs) + + if videos is not None: + video_features = [] + for video in videos: + video_frames = load_video(video, num_frames) + video_feature = self.image_processor(video_frames, return_tensors=return_tensors, **kwargs)["pixel_values"] + video_features.append(video_feature) + video_features = torch.stack(video_features, dim=0) + video_features = video_features.permute(0, 2, 1, 3, 4) + + if text is not None and images is not None: + encoding["pixel_values"] = image_features.pixel_values + return encoding + if text is not None and videos is not None: + encoding["video_pixel_values"] = video_features + return encoding + elif text is not None: + return encoding + elif images is not None: + return BatchEncoding(data=dict(**image_features), tensor_type=return_tensors) + else: + return BatchEncoding(data=dict(video_pixel_values=video_pixel_values), tensor_type=return_tensors) + + def batch_decode(self, skip_special_tokens=True, *args, **kwargs): + """ + This method forwards all its arguments to CLIPTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please + refer to the docstring of this method for more information. + """ + return self.tokenizer.batch_decode(*args, skip_special_tokens=skip_special_tokens, **kwargs) + + def decode(self, skip_special_tokens=True, *args, **kwargs): + """ + This method forwards all its arguments to CLIPTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to + the docstring of this method for more information. + """ + return self.tokenizer.decode(*args, skip_special_tokens=skip_special_tokens, **kwargs) + + +class MplugOwlImageProcessor(CLIPImageProcessor): + pass + + +def detokenize_generations(tokens_gpu_tensor, lengths_gpu_tensor, return_segments, tokenizer): + """Detokenize the generated tokens.""" + + prompts_plus_generations = [] + if return_segments: + prompts_plus_generations_segments = [] + + tokens = tokens_gpu_tensor.cpu().numpy().tolist() + lengths = lengths_gpu_tensor.cpu().numpy().tolist() + for sequence_tokens, length in zip(tokens, lengths): + sequence_tokens = sequence_tokens[:length] + prompts_plus_generations.append(tokenizer.detokenize(sequence_tokens)) + if return_segments: + from tokenizers.decoders import Metaspace + + if hasattr(tokenizer, "tokenizer"): + if isinstance(tokenizer.tokenizer.decoder, Metaspace): + words = tokenizer.tokenizer.decode(sequence_tokens) + else: + words = [] + for token in sequence_tokens: + word = tokenizer.tokenizer.decoder[token] + word = bytearray([tokenizer.tokenizer.byte_decoder[c] for c in word]).decode("utf-8", errors="replace") + words.append(word) + prompts_plus_generations_segments.append(words) + else: + words = tokenizer.detokenize(sequence_tokens) + # else: + # words = [] + # for token in sequence_tokens: + # word = tokenizer.tokenizer.decoder[token] + # word = bytearray( + # [tokenizer.tokenizer.byte_decoder[c] for c in word]).decode( + # 'utf-8', errors='replace') + # words.append(word) + prompts_plus_generations_segments.append(words) + + if return_segments: + return tokens, prompts_plus_generations, prompts_plus_generations_segments + + return tokens, prompts_plus_generations + + +def tokenize_prompts(prompts=None, tokens_to_generate=None, add_BOS=None, rank=0, tokenizer=None, ignore_dist=False, **kwargs): + """Tokenize prompts and make them avaiable on all ranks.""" + + # On all ranks set to None so we can pass them to functions + prompts_tokens_cuda_long_tensor = None + prompts_length_cuda_long_tensor = None + + # On the specified rank, build the above. + attention_mask = None + if ignore_dist or torch.distributed.get_rank() == rank: + assert prompts is not None + assert tokens_to_generate is not None + # Tensor of tokens padded and their unpadded length. + prompts_tokens_cuda_long_tensor, prompts_length_cuda_long_tensor, attention_mask = _tokenize_prompts_and_batch(prompts, tokens_to_generate, add_BOS, tokenizer, **kwargs) + # We need the sizes of these tensors for the boradcast + [ + prompts_tokens_cuda_long_tensor.size(0), # Batch size + prompts_tokens_cuda_long_tensor.size(1), + ] # Sequence lenght + + return { + "input_ids": prompts_tokens_cuda_long_tensor, + "attention_mask": attention_mask, + # "prompt_length": prompts_length_cuda_long_tensor, + } + + +def _tokenize_prompts_and_batch(prompts, tokens_to_generate, add_BOS, tokenizer, **kwargs): + """Given a set of prompts and number of tokens to generate: + - tokenize prompts + - set the sequence length to be the max of length of prompts + plus the number of tokens we would like to generate + - pad all the sequences to this length so we can convert them + into a 2D tensor. + """ + + # Tokenize all the prompts. + # if add_BOS: + # prompts_tokens = [[tokenizer.bos] + tokenizer.tokenize(prompt) + # for prompt in prompts] + # else: + # prompts_tokens = [tokenizer.tokenize(prompt) for prompt in prompts] + + prompts_tokens = [_tokenize_prompt(prompt, tokenizer, add_BOS, **kwargs) for prompt in prompts] + + # Now we have a list of list of tokens which each list has a different + # size. We want to extend this list to: + # - incorporate the tokens that need to be generated + # - make all the sequences equal length. + # Get the prompts length. + prompts_length = [len(prompt_tokens) for prompt_tokens in prompts_tokens] + # Get the max prompts length. + max_prompt_len = max(prompts_length) + # Number of tokens in the each sample of the batch. + samples_length = max_prompt_len + tokens_to_generate + # Now update the list of list to be of the same size: samples_length. + for prompt_tokens, prompt_length in zip(prompts_tokens, prompts_length): + padding_size = samples_length - prompt_length + prompt_tokens.extend([tokenizer.eos_token_id] * padding_size) + + # Now we are in a structured format, we can convert to tensors. + prompts_tokens_tensor = torch.LongTensor(prompts_tokens) + prompts_length_tensor = torch.LongTensor(prompts_length) + attention_mask = torch.zeros(prompts_tokens_tensor.shape[:2]) + for i, l in enumerate(prompts_length_tensor): + attention_mask[i, :l] = 1 + return prompts_tokens_tensor, prompts_length_tensor, attention_mask + + +def _tokenize_prompt(prompt, tokenizer, add_BOS=False, media_info={"": 65, "<|video|>": 65}, **kwargs): + media_tokens = {k: -int(i + 1) for i, k in enumerate(media_info.keys())} + media_lengths = media_info.copy() + + if add_BOS: + prompt_chunk = [tokenizer.bos_token_id] + else: + prompt_chunk = [] + + # Pure Text + if all([media_token not in prompt for media_token in media_tokens.keys()]): + enc_chunk = prompt_chunk + tokenizer(prompt, add_special_tokens=False, **kwargs)["input_ids"] + + # Multi-Modal Text + else: + enc_chunk = prompt_chunk + pattern = "|".join(map(re.escape, list(media_tokens.keys()))) + chunk_strs = re.split(f"({pattern})", prompt) + chunk_strs = [x for x in chunk_strs if len(x) > 0] + for idx, chunk_str in enumerate(chunk_strs): + if chunk_str in media_tokens: + enc_chunk += [media_tokens[chunk_str]] * media_lengths[chunk_str] + else: + tmp_chunk = tokenizer(chunk_str, add_special_tokens=False)["input_ids"] + # if idx < len(chunk_strs) - 1: # Last chunk should not have eos + # tmp_chunk += [tokenizer.eod_id] + enc_chunk += tmp_chunk + return enc_chunk + + +if __name__ == "__main__": + pass diff --git a/lmms_eval/models/mplug_owl_video/tokenization_mplug_owl.py b/lmms_eval/models/mplug_owl_video/tokenization_mplug_owl.py new file mode 100644 index 00000000..22384b44 --- /dev/null +++ b/lmms_eval/models/mplug_owl_video/tokenization_mplug_owl.py @@ -0,0 +1,62 @@ +# coding=utf-8 +# Copyright 2022 x-plug and The HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Tokenization classes for MplugOwl.""" + +from transformers.utils import logging +from transformers.models.llama.tokenization_llama import LlamaTokenizer + + +logger = logging.get_logger(__name__) + +VOCAB_FILES_NAMES = {"vocab_file": "vocab.txt"} + +PRETRAINED_VOCAB_FILES_MAP = { + "vocab_file": { + "MAGAer13/mplug-owl-llama-7b": "https://huggingface.co/MAGAer13/mplug-owl-llama-7b/resolve/main/vocab.txt", + }, +} + +PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = { + "MAGAer13/mplug-owl-llama-7b": 2048, +} + + +class MplugOwlTokenizer(LlamaTokenizer): + def __init__( + self, + vocab_file, + unk_token="", + bos_token="", + eos_token="", + pad_token="", + sp_model_kwargs=None, + add_bos_token=False, + add_eos_token=False, + clean_up_tokenization_spaces=False, + **kwargs, + ): + super().__init__( + vocab_file, + unk_token, + bos_token, + eos_token, + pad_token, + sp_model_kwargs, + add_bos_token, + add_eos_token, + clean_up_tokenization_spaces, + **kwargs, + ) + self.eod_id = self.eos_token_id diff --git a/lmms_eval/models/qwen_vl.py b/lmms_eval/models/qwen_vl.py old mode 100644 new mode 100755 index 4d9cdbb1..e55ad7c9 --- a/lmms_eval/models/qwen_vl.py +++ b/lmms_eval/models/qwen_vl.py @@ -242,12 +242,11 @@ def _collate(x): if len(visual_paths) == 0: for context in contexts: query.append({"text": context}) - else: + else: for visual_path, context in zip(visual_paths, contexts): query.append({"image": visual_path}) query.append({"text": context}) - questions = self.tokenizer.from_list_format(query) input_ids = self.tokenizer(questions, return_tensors="pt", padding="longest") diff --git a/lmms_eval/models/reka.py b/lmms_eval/models/reka.py new file mode 100644 index 00000000..d5e85d5d --- /dev/null +++ b/lmms_eval/models/reka.py @@ -0,0 +1,189 @@ +from PIL import Image +from io import BytesIO +from copy import deepcopy +import numpy as np +import os +import base64 +from typing import List, Tuple +from tqdm import tqdm +import requests as url_requests +import time +import logging +import json + +from lmms_eval.api.instance import Instance +from lmms_eval.api.model import lmms +from lmms_eval.api.registry import register_model +from accelerate import Accelerator, DistributedType + +NUM_SECONDS_TO_SLEEP = 30 +eval_logger = logging.getLogger("lmms-eval") + +try: + from reka.client import Reka as RekaClient + from reka import ChatMessage + from decord import VideoReader, cpu +except Exception as e: + eval_logger.error(f"Error importing reka: {e}") + + +@register_model("reka") +class Reka(lmms): + def __init__( + self, + model_version: str = "reka-edge", + modality: str = "image", + max_frames_for_video: int = 10, + timeout: int = 120, + continual_mode: bool = False, + response_persistent_folder: str = None, # We will cache the Gemini API response in this path and use it for future requests + **kwargs, + ) -> None: + super().__init__() + self.model_version = model_version + self.modality = modality + self.max_frames_for_video = max_frames_for_video + self.timeout = timeout + self.continual_mode = continual_mode + if self.continual_mode and response_persistent_folder is None: + raise ValueError("Continual mode requires a persistent path for the response. Please provide a valid path.") + self.response_persistent_folder = response_persistent_folder + self.response_persistent_file = os.path.join(self.response_persistent_folder, f"{self.model_version}_response.json") + + if os.path.exists(self.response_persistent_file): + with open(self.response_persistent_file, "r") as f: + self.response_cache = json.load(f) + self.cache_mode = "resume" + else: + self.response_cache = {} + self.cache_mode = "start" + + self.reka = RekaClient(api_key=os.getenv("REKA_API_KEY", "YOUR_API_KEY")) + + accelerator = Accelerator() + if accelerator.num_processes > 1: + assert accelerator.distributed_type in [DistributedType.FSDP, DistributedType.MULTI_GPU, DistributedType.DEEPSPEED], "Unsupported distributed type provided. Only DDP and FSDP are supported." + self.accelerator = accelerator + if self.accelerator.is_local_main_process: + eval_logger.info(f"Using {accelerator.num_processes} devices with data parallelism") + self._rank = self.accelerator.local_process_index + self._world_size = self.accelerator.num_processes + else: + self.accelerator = accelerator + self._rank = self.accelerator.local_process_index + self._world_size = self.accelerator.num_processes + + self.device = self.accelerator.device + + def encode_image(self, image): + if type(image) == list: + media_urls = [] + for img in image: + output_buffer = BytesIO() + img.save(output_buffer, format="PNG") + byte_data = output_buffer.getvalue() + base64_str = base64.b64encode(byte_data).decode("utf-8") + media_urls.append(f"data:image/jpeg;base64,{base64_str}") + return media_urls + else: + output_buffer = BytesIO() + image.save(output_buffer, format="PNG") + byte_data = output_buffer.getvalue() + base64_str = base64.b64encode(byte_data).decode("utf-8") + + return f"data:image/jpeg;base64,{base64_str}" + + def encode_video(self, video_path): + vr = VideoReader(video_path, ctx=cpu(0)) + total_frame_num = len(vr) + uniform_sampled_frames = np.linspace(0, total_frame_num - 1, self.max_frames_for_video, dtype=int) + frame_idx = uniform_sampled_frames.tolist() + frames = vr.get_batch(frame_idx).asnumpy() + + base64_frames = [] + for frame in frames: + img = Image.fromarray(frame) + output_buffer = BytesIO() + img.save(output_buffer, format="PNG") + byte_data = output_buffer.getvalue() + base64_str = base64.b64encode(byte_data).decode("utf-8") + base64_frames.append(f"data:image/jpeg;base64,{base64_str}") + + return base64_frames + + def generate_until(self, requests) -> List[str]: + res = [] + pbar = tqdm(total=len(requests), disable=(self.rank != 0), desc="Model Responding") + + for context, gen_kwargs, doc_to_visual, doc_id, task, split in [reg.args for reg in requests]: + if self.continual_mode is True and self.cache_mode == "resume": + doc_uuid = f"{task}___{split}___{doc_id}" + if doc_uuid in self.response_cache: + response_text = self.response_cache[doc_uuid] + if response_text: + res.append(response_text) + pbar.update(1) + continue + + visual = doc_to_visual(self.task_dict[task][split][doc_id]) + + message_content = [] + + if self.modality == "image": + media_urls = self.encode_image(visual) + message_content.append({"type": "text", "text": context}) + for media_url in media_urls: + message_content.append({"type": "image_url", "image_url": media_url}) + elif self.modality == "video": + message_content.append({"type": "text", "text": context}) + assert len(visual) == 1, "Reka only supports one video per request" + media_urls = self.encode_video(visual[0]) + assert len(media_urls) == self.max_frames_for_video, f"Reka only supports {self.max_frames_for_video} frames per request" + for media_url in media_urls: + message_content.append({"type": "image_url", "image_url": media_url}) + + if "max_new_tokens" not in gen_kwargs: + gen_kwargs["max_new_tokens"] = 1024 + if "temperature" not in gen_kwargs: + gen_kwargs["temperature"] = 0 + if "top_p" not in gen_kwargs: + gen_kwargs["top_p"] = None + if "num_beams" not in gen_kwargs: + gen_kwargs["num_beams"] = 1 + + for attempt in range(5): + try: + response = self.reka.chat.create( + messages=[ + ChatMessage( + role="user", + content=message_content, + ) + ], + model=self.model_version, + ) + response_text = response.responses[0].message.content.strip() + break # If successful, break out of the loop + + except Exception as e: + eval_logger.info(f"Attempt {attempt + 1} failed with error: {str(e)}") + if attempt < 5 - 1: # If we have retries left, sleep and then continue to next attempt + time.sleep(NUM_SECONDS_TO_SLEEP) + else: # If this was the last attempt, log and return empty + eval_logger.error(f"All 5 attempts failed. Last error message: {str(e)}") + response_text = "" + + res.append(response_text) + pbar.update(1) + if self.continual_mode is True: # Cache the response + doc_uuid = f"{task}___{split}___{doc_id}" + self.response_cache[doc_uuid] = response_text + with open(self.response_persistent_file, "w") as f: + json.dump(self.response_cache, f) + + pbar.close() + return res + + def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]: + # TODO + assert False, "Reka not support loglikelihood" diff --git a/lmms_eval/models/video_chatgpt.py b/lmms_eval/models/video_chatgpt.py new file mode 100644 index 00000000..a724cd98 --- /dev/null +++ b/lmms_eval/models/video_chatgpt.py @@ -0,0 +1,200 @@ +import os +from lmms_eval import utils +from lmms_eval.api.instance import Instance +from lmms_eval.api.model import lmms +from lmms_eval.api.registry import register_model + +from accelerate import Accelerator, DistributedType, InitProcessGroupKwargs +from accelerate.state import AcceleratorState +from huggingface_hub import snapshot_download +import torch +from PIL import Image + +from datetime import timedelta +import logging +from typing import List, Tuple, Optional, Union +from tqdm import tqdm + +try: + from lmms_eval.models.video_chatgpt.eval.model_utils import load_video, initialize_model + from lmms_eval.models.video_chatgpt.inference import video_chatgpt_infer, video_chatgpt_infer_ppl, get_spatio_temporal_features_torch +except ImportError: + eval_logger = logging.getLogger("lmms-eval") + eval_logger.info("Failed to import video_chatgpt modules") + +from lmms_eval.models.model_utils.load_video import read_video_pyav + +eval_logger = logging.getLogger("lmms-eval") + + +@register_model("video_chatgpt") +class VideoChatGPT(lmms): + def __init__( + self, + batch_size: Optional[Union[int, str]] = 1, + projection_path: str = "MBZUAI/Video-ChatGPT-7B", + model_path: str = "mmaaz60/LLaVA-7B-Lightening-v1-1", + device_map="cuda:0", + device: Optional[str] = "cuda:0", + num_frm: Optional[Union[int, str]] = 100, + ) -> None: + super().__init__() + self.batch_size_per_gpu = int(batch_size) + self.num_frm = int(num_frm) + accelerator_kwargs = InitProcessGroupKwargs(timeout=timedelta(weeks=52)) + accelerator = Accelerator(kwargs_handlers=[accelerator_kwargs]) + if accelerator.num_processes > 1: + self._device = torch.device(f"cuda:{accelerator.local_process_index}") + self.device_map = f"cuda:{accelerator.local_process_index}" + elif accelerator.num_processes == 1 and device_map == "auto": + self._device = torch.device(device) + self.device_map = device_map + else: + self._device = torch.device(f"cuda:{accelerator.local_process_index}") + self.device_map = f"cuda:{accelerator.local_process_index}" + try: + self.model, self.vision_tower, self.tokenizer, self.image_processor, self.video_token_len = initialize_model(model_path, projection_path, device=self.device) + except: + eval_logger.info("Does not find the model from the path you provide, try downloading from the hf repo.") + model_path = snapshot_download(repo_id=model_path) + projection_path = os.path.join(snapshot_download(repo_id=projection_path), "video_chatgpt-7B.bin") + self.model, self.vision_tower, self.tokenizer, self.image_processor, self.video_token_len = initialize_model(model_path, projection_path, device=self.device) + + if accelerator.num_processes > 1: + assert accelerator.distributed_type in [DistributedType.FSDP, DistributedType.MULTI_GPU, DistributedType.DEEPSPEED], "Unsupported distributed type provided. Only DDP and FSDP are supported." + # If you want to use DistributedType.DEEPSPEED, you have to run accelerate config before using the model + # Also, you have to select zero stage 0 (equivalent to DDP) in order to make the prepare model works + # I tried to set different parameters in the kwargs to let default zero 2 stage works, but it didn't work. + if accelerator.distributed_type == DistributedType.DEEPSPEED: + kwargs = { + "train_micro_batch_size_per_gpu": self.batch_size_per_gpu, + "train_batch_size": self.batch_size_per_gpu * accelerator.num_processes, + } + AcceleratorState().deepspeed_plugin.deepspeed_config_process(must_match=True, **kwargs) + eval_logger.info("Detected that you are using DistributedType.DEEPSPEED. Make sure you run `accelerate config` and set zero stage to 0") + if accelerator.distributed_type == DistributedType.FSDP or accelerator.distributed_type == DistributedType.DEEPSPEED: + self._model = accelerator.prepare(self.model) + else: + self._model = accelerator.prepare_model(self.model, evaluation_mode=True) + self.accelerator = accelerator + if self.accelerator.is_local_main_process: + eval_logger.info(f"Using {accelerator.num_processes} devices with data parallelism") + self._rank = self.accelerator.local_process_index + self._world_size = self.accelerator.num_processes + elif accelerator.num_processes == 1 and device_map == "auto": + eval_logger.info(f"Using {accelerator.num_processes} devices with tensor parallelism") + self._rank = 0 + self._word_size = 1 + else: + eval_logger.info(f"Using single device: {self._device}") + self.model.to(self._device) + self._rank = 0 + self._world_size = 1 + + def flatten(self, input): + new_list = [] + for i in input: + for j in i: + new_list.append(j) + return new_list + + def generate_until(self, requests) -> List[str]: + res = [] + pbar = tqdm(total=len(requests), disable=(self.rank != 0), desc="Model Responding") + + for contexts, gen_kwargs, doc_to_visual, doc_id, task, split in [reg.args for reg in requests]: + # encode, pad, and truncate contexts for this batch + visuals = [doc_to_visual(self.task_dict[task][split][doc_id])] + visuals = self.flatten(visuals) + # videos = [] + for visual in visuals: + video_frames = read_video_pyav(visual, num_frm=self.num_frm) + target_h, target_w = 224, 224 + # If image shape is not as target, resize it + if video_frames.shape[-3] != target_h or video_frames.shape[-2] != target_w: + video_frames = torch.from_numpy(video_frames).permute(0, 3, 1, 2).float() + video_frames = torch.nn.functional.interpolate(video_frames, size=(target_h, target_w)) + video_frames = video_frames.permute(0, 2, 3, 1).to(torch.uint8).numpy() + video_frames = [Image.fromarray(frame) for frame in video_frames] + if len(video_frames) > self.num_frm: + video_frames = video_frames[: self.num_frm] + # VideoChatGPT load video return a list of PIL Image + # videos += video_frames + + output = video_chatgpt_infer( + video_frames, contexts, conv_mode="video-chatgpt_v1", model=self.model, vision_tower=self.vision_tower, tokenizer=self.tokenizer, image_processor=self.image_processor, video_token_len=self.video_token_len + ) + + res.append(output) + pbar.update(1) + + return res + + def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]: + res = [] + pbar = tqdm(total=len(requests), disable=(self.rank != 0), desc="Model Responding") + + for contexts, doc_to_target, doc_to_visual, doc_id, task, split in [reg.args for reg in requests]: + # encode, pad, and truncate contexts for this batch + if type(doc_to_target) == str: + continuation = doc_to_target + else: + continuation = doc_to_target(self.task_dict[task][split][doc_id]) + visuals = [doc_to_visual(self.task_dict[task][split][doc_id])] + visuals = self.flatten(visuals) + videos = [] + for visual in visuals: + video_frames = load_video(visual, num_frm=self.num_frm) + # VideoChatGPT load video return a list of PIL Image + videos += video_frames + image_tensor = self.image_processor.preprocess(videos, return_tensors="pt")["pixel_values"] + + # Move image tensor to GPU and reduce precision to half + image_tensor = image_tensor.half().to(self.device) + + # Generate video spatio-temporal features + with torch.no_grad(): + image_forward_outs = self.vision_tower(image_tensor, output_hidden_states=True) + frame_features = image_forward_outs.hidden_states[-2][:, 1:] # Use second to last layer as in LLaVA + video_spatio_temporal_features = get_spatio_temporal_features_torch(frame_features).cuda() + + outputs, input_ids, context_ids = video_chatgpt_infer_ppl( + # video_frames, + contexts, + continuation, + conv_mode="video-chatgpt_v1", + model=self.model, + vision_tower=self.vision_tower, + tokenizer=self.tokenizer, + image_processor=self.image_processor, + video_token_len=self.video_token_len, + video_spatio_temporal_features=video_spatio_temporal_features, + ) + + loss = outputs["loss"] + # loss = torch.exp(loss) + logits = outputs["logits"] + greedy_tokens = logits.argmax(dim=-1) + cont_toks = input_ids[:, context_ids.shape[1] :] # [1, seq] + greedy_tokens = greedy_tokens[:, context_ids.shape[1] : input_ids.shape[1]] # [1, seq] + max_equal = (greedy_tokens == cont_toks).all() + res.append((float(loss.item()), bool(max_equal))) + pbar.update(1) + pbar.close() + return res + + @property + def batch_size(self): + return self.batch_size_per_gpu + + @property + def device(self): + return self._device + + @property + def rank(self): + return self._rank + + @property + def world_size(self): + return self._world_size diff --git a/lmms_eval/models/video_chatgpt/__init__.py b/lmms_eval/models/video_chatgpt/__init__.py new file mode 100644 index 00000000..c5f48379 --- /dev/null +++ b/lmms_eval/models/video_chatgpt/__init__.py @@ -0,0 +1 @@ +from .model import VideoChatGPTLlamaForCausalLM diff --git a/lmms_eval/models/video_chatgpt/constants.py b/lmms_eval/models/video_chatgpt/constants.py new file mode 100644 index 00000000..c9ea9ac1 --- /dev/null +++ b/lmms_eval/models/video_chatgpt/constants.py @@ -0,0 +1,11 @@ +CONTROLLER_HEART_BEAT_EXPIRATION = 30 +WORKER_HEART_BEAT_INTERVAL = 15 + +LOGDIR = "." + + +# Defining model +DEFAULT_VIDEO_TOKEN = "