Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Upgrade to v0.2] Embracing Video Evaluations with LMMs-Eval #108

Merged
merged 208 commits into from
Jun 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
208 commits
Select commit Hold shift + click to select a range
2c03baf
Squashed commits from internal development repo
Apr 16, 2024
67ea2bd
[Feat] Add Video-Llava, Llama-vid, X-composer-HD, llava-wilder (Cherr…
kcz358 Apr 18, 2024
50959b4
Refactor code to handle customized configuration and attention implem…
Apr 18, 2024
8bbda22
Merge branch 'internal_main_dev' of https://github.com/EvolvingLMMs-L…
Apr 18, 2024
38f1806
Fix debug log message in utils.py
Apr 18, 2024
d2d66b2
[Feat & Fix] Add Claude, Qwen_VL_api; Fix xcomposer_4khd generation t…
kcz358 Apr 20, 2024
910015e
Import sympy package and handle ImportError in OlympiadBenchEvaluator
Apr 20, 2024
49e9035
Merge branch 'internal_main_dev' of https://github.com/EvolvingLMMs-L…
Apr 20, 2024
3cea658
Add shot examples and update create_one_query function
Luodian Apr 21, 2024
ba2fd6d
Remove redundant code in mathvista_doc_to_text function
Luodian Apr 21, 2024
21fe64a
Fix error loading OlympiadBenchEvaluator from cn_utils and en_utils
Luodian Apr 21, 2024
4153703
Update mathvista_testmini.yaml configuration
Luodian Apr 21, 2024
1f7e419
Add auto strip for all task
kcz358 Apr 22, 2024
d3febcf
Fix whitespace issue in process_results method
Luodian Apr 23, 2024
c170f55
Merge branch 'internal_main_dev' of https://github.com/EvolvingLMMs-L…
Luodian Apr 23, 2024
ea063fc
Add SimpleMultiChoiceRegexFilter class for AI2D/RealworldQA dataset
Luodian Apr 24, 2024
8159d56
[Merge] git checkout --patch sglang form public (#80)
kcz358 Apr 26, 2024
a601276
Update lmms_eval/models/__init__.py and lmms_eval/api/samplers.py
Luodian Apr 26, 2024
a862487
Refactor output path creation in cli_evaluate_single function
Luodian Apr 29, 2024
4bcd002
updates
Apr 29, 2024
3022287
Fix log_samples suffix length
Luodian Apr 30, 2024
05666bf
add idefics2
Luodian Apr 30, 2024
dcc67c4
Add video encoding functionality and update prompt for video modality
Luodian Apr 30, 2024
b934ffd
Add Accelerator and DistributedType imports***
Luodian May 1, 2024
582fd44
[Feat & Fix]Add GPT4V for video and fix the import issue for older ve…
Luodian May 1, 2024
b82e693
Add Gemini API/Model and update filtering logic for worldqa
Luodian May 1, 2024
d64be02
Add 'decord' to dependencies in pyproject.toml
Luodian May 1, 2024
6377474
Update default value for log_samples_suffix
Luodian May 1, 2024
324c4e5
Merge branch 'internal_main_dev' into feat/gemini_worldqa
Luodian May 1, 2024
148eb9a
Merge branches 'feat/gemini_worldqa' and 'feat/gemini_worldqa' of htt…
Luodian May 1, 2024
ee31b42
Update gemini_input_content example_id to use doc_id
Luodian May 1, 2024
f49d35d
Add worldqa gpt eval
kcz358 May 2, 2024
b4e15d2
Add gpt eval in metric list
kcz358 May 2, 2024
656184a
Add model_name parameter to Llava constructor
Luodian May 2, 2024
65abc44
[WIP] Add Gemini interface for worldqa, update worldqa's filtering lo…
Luodian May 2, 2024
ee3fced
update
Luodian May 3, 2024
bce39fe
add InternVL_chat_1.5b model for lmmseval
choiszt May 3, 2024
88614ee
update
Luodian May 4, 2024
7b40550
Update generation_kwargs in YAML files and save images in JPEG format
Luodian May 4, 2024
18ca590
Group MMMU images into one image (#83)
pufanyi May 4, 2024
9ee68f6
Fix formatting and typos in lmms_eval code
Luodian May 4, 2024
140c8a1
update setup and bash command for evaluating InternVL
choiszt May 5, 2024
64b5638
Snapshot download for internvl ckpt
kcz358 May 6, 2024
eec83c3
Update prompt in utils.py
kcz358 May 6, 2024
b8e31d8
update (#85)
Luodian May 6, 2024
e19e7f6
update setup info
choiszt May 6, 2024
43f5276
Let gen_kwargs decide gen_config
kcz358 May 7, 2024
fb75b40
Merge branch 'internal_main_dev' into choiszt/internvl
Luodian May 7, 2024
c29d494
Revert "remove internvl subproject"
Luodian May 7, 2024
4da72eb
Revert "[Feat] add intern-vl (#87)"
Luodian May 7, 2024
bb7d903
No error for internvl if internvl not installed
kcz358 May 8, 2024
ef439e1
Change split of small
kcz358 May 8, 2024
914eb41
Change judge prompt of wilder
kcz358 May 8, 2024
181672a
Merge branch 'choiszt/internvl' of https://github.com/EvolvingLMMs-La…
kcz358 May 8, 2024
cf31499
Merge branch 'choiszt/internvl' into internal_main_dev
kcz358 May 8, 2024
4a9de44
Remove unnecessary items
kcz358 May 8, 2024
2ad1e14
Fix gpt4v one process bugs
kcz358 May 8, 2024
51c6a6d
Update gitignore
kcz358 May 9, 2024
aabb964
Add video detail description task and template YAML files
ZhangYuanhan-AI May 12, 2024
a55eb80
Update llava imports and add mm_resampler_location
Luodian May 12, 2024
5bd536c
convert contexts to list if necessary and remove unnecessary construc…
tupini07 Apr 22, 2024
c58f941
refactor query construction for clarity
tupini07 Apr 22, 2024
8b9b907
Add default prompt for xcomposer
kcz358 May 14, 2024
302fed6
Better task list_with_num
kcz358 May 15, 2024
1fd9007
Merge branch 'internal_main_dev' of https://github.com/EvolvingLMMs-L…
kcz358 May 15, 2024
896f881
Fix idefics2 llava in the wild bugs
kcz358 May 16, 2024
24764f7
Remove redundant code in fuyu
kcz358 May 16, 2024
9442017
Merge branch 'internal_main_dev' of https://github.com/EvolvingLMMs-L…
kcz358 May 16, 2024
c73e93a
Fix instructblip qformer size mismatch and multi-images problem
kcz358 May 16, 2024
06e4981
[WIP] add video detailed description dataset, merged live bench (#90)
Luodian May 16, 2024
83f1dca
Better livebench process result
kcz358 May 19, 2024
1bf0c17
Add subtask field to gpt4_eval_score (#91)
pufanyi May 19, 2024
acf69e2
Add default max new tokens to idefics2
kcz358 May 20, 2024
a301fbf
Comment out parse result in xcomposer
kcz358 May 20, 2024
4d00896
Comment out Spice in caption task so that don't need to download stan…
kcz358 May 20, 2024
0c2a8af
Update LiveBench Eval Prompt (#92)
pufanyi May 23, 2024
e263d01
Change chinese ' to english ' in mix_evals
kcz358 May 23, 2024
e2b90ef
Merge branch 'internal_main_dev' of https://github.com/EvolvingLMMs-L…
kcz358 May 23, 2024
cb2ec97
Remove register qwen config in llava_vid fornow
kcz358 May 24, 2024
0863814
[WIP] adding video datasets (#93)
Luodian May 24, 2024
b55fa9a
Gemini & Claude & NExT-QA (#95)
pufanyi May 24, 2024
254feeb
Update GeminiAPI class to support continual mode and cache API responses
Luodian May 25, 2024
73b6d4e
Add review field to evaluated results
Luodian May 25, 2024
f1f86e6
Add delay before returning uploaded object in encode_video method
pufanyi May 26, 2024
bc4b541
Set llava vid compatible to public
kcz358 May 26, 2024
174d8e1
[Feat] add gemini api for video and continual mode to gemini api mode…
Luodian May 26, 2024
a8b2c88
Unified the error log in the task, set to debug
kcz358 May 26, 2024
312c160
Make sure unzip only happens in main process
kcz358 May 26, 2024
82b1cd5
Merge branch 'internal_main_dev' of https://github.com/EvolvingLMMs-L…
kcz358 May 26, 2024
00da77e
Complete the implementation of CVRR video dataset (#97)
KairuiHu May 26, 2024
965cc5b
dont cache failed content
pufanyi May 26, 2024
6e082f6
Merge remote-tracking branch 'origin/internal_main_dev' into pufanyi/…
pufanyi May 26, 2024
e185640
Fix content generation in GeminiAPI
pufanyi May 26, 2024
a9cc36f
safty settings
pufanyi May 26, 2024
ce67461
Fix HarmBlockThreshold import in GeminiAPI
pufanyi May 26, 2024
bb09ad4
Fix harmful content generation and update logging configuration
pufanyi May 26, 2024
15de067
fix bug in continous mode
pufanyi May 26, 2024
2ec645d
Refactor GeminiAPI class to improve cache handling
pufanyi May 26, 2024
a474f22
Fix issue with clearing content variable in GeminiAPI class
pufanyi May 26, 2024
99518f4
Refactor exception handling in GeminiAPI class
pufanyi May 26, 2024
bd908ef
Update safety settings in GeminiAPI.generate_content() method
pufanyi May 27, 2024
b2d5df1
optimize query in gemini
pufanyi May 27, 2024
6148e1e
Merge pull request #98 from EvolvingLMMs-Lab/pufanyi/fix_gemini
pufanyi May 27, 2024
128a76e
Add load video in model_utils
kcz358 May 27, 2024
fa32668
Change load video to pyav in videoChatGPT
kcz358 May 27, 2024
80fbf66
Fix the bug of file overwrite when running multiple video datasets (#99)
KairuiHu May 27, 2024
fa1224f
Revise gemini api import
kcz358 May 27, 2024
0a89fb2
Change video_llava to use transformers instead build from source
kcz358 May 27, 2024
212e396
Merge branch 'internal_main_dev' of https://github.com/EvolvingLMMs-L…
kcz358 May 27, 2024
6728dec
Optimize the inference for videochatgpt: Now the 3 tasks can inferenc…
KairuiHu May 27, 2024
25fccff
Change llama_vid to use pyav to load video
kcz358 May 27, 2024
b11faab
Fix some small bugs (#101)
pufanyi May 28, 2024
48a3c1b
Rename file name for video evaluation inference results (#102)
KairuiHu May 28, 2024
494795d
Add original load_video to llama_vid if we need to use this in the fu…
kcz358 May 28, 2024
670ad7a
Add pyav load video option to llava_vid
kcz358 May 28, 2024
5e9b244
Better indices calculating for load video using pyav
kcz358 May 28, 2024
d1715b3
Add tqdm for gpt eval and remove unused import
kcz358 May 28, 2024
21c8341
Group the tasks for videochatgpt and cvrr (#103)
KairuiHu May 28, 2024
8fcf639
Add .webm support for loading videos
kcz358 May 28, 2024
92e977c
NExTQA-MC (#104)
pufanyi May 29, 2024
bdfc8a8
Optimize dataset caching and download process
Luodian May 29, 2024
f154af3
Merge branch 'internal_main_dev' of https://github.com/EvolvingLMMs-L…
Luodian May 29, 2024
30e1889
Fix the bug of videochatgpt_consistency in which pred1 and pred2 are …
KairuiHu May 30, 2024
06fff45
[WIP] Refactor activitynetqa_generation for correct score parsing. (#…
Luodian May 30, 2024
9dbbdbf
Revert "[WIP] Refactor activitynetqa_generation for correct score par…
Luodian May 30, 2024
c136e77
[Reka and Fix] move gpt eval to process_results stage. (#108)
Luodian May 31, 2024
1e9c722
chore: Refactor video_detail_description aggregation function name
Luodian May 31, 2024
39ec1ad
chore: Add pywsd dependency for improved word sense disambiguation
Luodian May 31, 2024
a7fbf5b
remove converstion to accuracy in generic pipeline, leave the logic b…
Luodian May 31, 2024
37b2e1b
NextQA-OE Align (#109)
pufanyi May 31, 2024
0ddf03b
Refactor videochatgpt and cvrr to correct score aggregation, and also…
KairuiHu Jun 3, 2024
bbe9ab0
YouCook2 (#112)
pufanyi Jun 3, 2024
722f6d6
PerceptionTest MCQA(test), mc and mc_ppl enabled (#113)
KairuiHu Jun 4, 2024
c3b2ef4
FromLog Model (#114)
pufanyi Jun 6, 2024
955df97
Enable PerceptionTest-Val (for scoring) and TempCompass-MC dataset (#…
KairuiHu Jun 8, 2024
8a2d439
Complete the implementation of TempCompass dataset and alignment by v…
KairuiHu Jun 8, 2024
cb5e71f
Update README.md
Luodian Jun 8, 2024
a81e30b
[WIP] check video datasets, in progress (#110)
Luodian Jun 9, 2024
4addd8b
Refactor exclusion patterns in pyproject.toml and tool.wheel
Luodian Jun 9, 2024
d10e5ca
Revise pyav load stream logic
kcz358 Jun 9, 2024
0a7d8f2
Add tools to calculate avg time in the video dataset
kcz358 Jun 9, 2024
3a4260c
Merge branch 'internal_main_dev' of https://github.com/EvolvingLMMs-L…
kcz358 Jun 9, 2024
105fb25
update VATEX datasets (#118)
choiszt Jun 9, 2024
e4ac135
add download logic to support force download option
Luodian Jun 9, 2024
15df0b9
Hardcode the loading logic for v1.5 in llavavid, prevent loading from…
kcz358 Jun 10, 2024
b7a8d52
Merge branch 'internal_main_dev' of https://github.com/EvolvingLMMs-L…
kcz358 Jun 10, 2024
579f537
Add videochatgpt inference file
kcz358 Jun 10, 2024
ef91210
Add mplug Owl video
kcz358 Jun 10, 2024
7110578
Add necessary file for mplug Owl video to load
kcz358 Jun 10, 2024
b421c16
Add mplug owl to model __init__ file
kcz358 Jun 10, 2024
8d2fc90
Pop out force_download in load dataset, otherwise a lot error
kcz358 Jun 10, 2024
809a3a9
chore: Update dataset paths and dependencies
Luodian Jun 10, 2024
be30fc8
Update
Luodian Jun 10, 2024
844c734
chore: Update dataset paths and dependencies
Luodian Jun 10, 2024
f905a51
chore: Remove unused VATEX validation configuration file
Luodian Jun 10, 2024
a000a60
Align CVRR and Tempcompass, Align GPT_Eval model name, Align Tempcomp…
KairuiHu Jun 10, 2024
0c1350d
add Video-MME (#119)
choiszt Jun 10, 2024
b675cee
More robust get time for video
kcz358 Jun 10, 2024
55d6a36
Remove os.chdir in mmmu group image
kcz358 Jun 10, 2024
77a7d8e
Add matching rule for egoschema generation task (mplug cannot follow …
KairuiHu Jun 11, 2024
1903fd4
Use data packet to load video in pyav
kcz358 Jun 11, 2024
327dc06
chore: Update task name in egoschema_mcppl.yaml
Luodian Jun 11, 2024
c8a67f1
Better load video utils
kcz358 Jun 11, 2024
0a589d6
chore: Update video_decode_backend to "decord" in LlavaVid class
Luodian Jun 11, 2024
bcd9326
Merge branch 'internal_main_dev' of https://github.com/EvolvingLMMs-L…
Luodian Jun 11, 2024
5cd09d0
Add readme to EgoSchema (#122)
KairuiHu Jun 11, 2024
0995342
Update vatex_test and vatex_val_zh (#123)
choiszt Jun 11, 2024
1a58ef8
chore: Update nltk dependencies and download missing resources
Luodian Jun 11, 2024
faf6c17
chore: Update nltk dependencies and download missing resources
Luodian Jun 11, 2024
7e53471
Delete 1.txt
choiszt Jun 11, 2024
b681392
Delete utils directory
choiszt Jun 11, 2024
595a1b1
Delete tools/make_vatex.py
choiszt Jun 11, 2024
f1149c5
Delete lmms_eval/models/mini_gemini.py
choiszt Jun 11, 2024
bf18c40
Delete lmms_eval/tasks/vatex/vatex_val_ZH.yaml
choiszt Jun 11, 2024
b995e13
Update utils.py
choiszt Jun 11, 2024
16de2dd
Update videomme.yaml
choiszt Jun 11, 2024
4a32971
chore: Update dataset paths and dependencies
Luodian Jun 11, 2024
8f2c48a
chore: Remove force_unzip flag in vatex_val_zh.yaml
Luodian Jun 11, 2024
5448286
Merge branch 'internal_main_dev' of https://github.com/EvolvingLMMs-L…
Luodian Jun 11, 2024
96cbe16
Update vatex_val_zh.yaml
choiszt Jun 11, 2024
699d7fd
chore: Update prompt for video caption in vatex_test.yaml
Luodian Jun 11, 2024
2c8b03e
chore: Update nltk installation and downloads in utils.py
pufanyi Jun 11, 2024
494db6a
lint
pufanyi Jun 11, 2024
ce44672
Update video_decode_backend to "pyav" in llava_vid.py
Luodian Jun 11, 2024
3bed7b9
Merge pull request #125 from EvolvingLMMs-Lab/pufanyi/nextqa-nltk-dow…
KairuiHu Jun 11, 2024
1b0df89
Add predict only args in main
kcz358 Jun 12, 2024
d3dd9e1
Add bypass metric
kcz358 Jun 12, 2024
1d788af
Add override metric
kcz358 Jun 12, 2024
4a60f2a
Add predict only for evaluation
kcz358 Jun 12, 2024
11bbf6a
Merge branch 'internal_main_dev' of https://github.com/EvolvingLMMs-L…
kcz358 Jun 12, 2024
12517e7
chore: Update gitignore and import error message in mplug_owl_video
Luodian Jun 12, 2024
c6319ba
Merge branch 'internal_main_dev' of https://github.com/EvolvingLMMs-L…
Luodian Jun 12, 2024
f6e962f
Update videomme.yaml
choiszt Jun 12, 2024
07bfd49
Add worldqa evaluator
kcz358 May 6, 2024
efd80c2
Fix worldqa specific args default prompt
kcz358 May 6, 2024
37f39b5
Add question in results for mc
kcz358 May 6, 2024
6cf1ded
Delete redundant print, and lint again to reformat (#126)
KairuiHu Jun 12, 2024
12f6de2
chore: Remove unnecessary dataset kwargs in ConfigurableTask
Luodian Jun 12, 2024
4a5990e
Add higher is better to Submission metric to avoid warning (#127)
KairuiHu Jun 12, 2024
b851ab9
Fix load video error for mp4 with packet
kcz358 Jun 12, 2024
e43bd84
chore: Remove unnecessary files and code related to live_bench and sf…
Luodian Jun 12, 2024
465bd42
Merge branch 'main' of https://github.com/EvolvingLMMs-Lab/lmms-eval …
Luodian Jun 12, 2024
c9b2252
Bump version to 0.2.0.dev0
Luodian Jun 12, 2024
50575a9
chore: Update lmms-eval to support video evaluations for LLaVA models
Luodian Jun 12, 2024
3415633
Update llava conv_template in lmms_eval/models/llava.py
Luodian Jun 12, 2024
f00d549
Update image alignment in README.md
Luodian Jun 12, 2024
cbeee20
chore: Update lmms-eval to support video evaluations for LLaVA models
Luodian Jun 12, 2024
05dc8e8
chore: Update lmms-eval to support video evaluations for LLaVA models
Luodian Jun 12, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
Empty file modified .github/issue_template.md
100644 → 100755
Empty file.
Empty file modified .github/pull_request_template.md
100644 → 100755
Empty file.
Empty file modified .github/workflows/black.yml
100644 → 100755
Empty file.
8 changes: 8 additions & 0 deletions .gitignore
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -29,3 +29,11 @@ ckpt
pretrained/
LLaVA/
*logs
temp/
InternVL/
logs/
data/
llava-video/
Video-MME/
VATEX/
lmms_eval/tasks/vatex/__pycache__/utils.cpython-310.pyc
Empty file modified .pre-commit-config.yaml
100644 → 100755
Empty file.
400 changes: 167 additions & 233 deletions README.md
100644 → 100755

Large diffs are not rendered by default.

Empty file modified docs/README.md
100644 → 100755
Empty file.
Empty file modified docs/commands.md
100644 → 100755
Empty file.
122 changes: 122 additions & 0 deletions docs/current_tasks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# Current Tasks

> () indicates the task name in the lmms_eval. The task name is also used to specify the dataset in the configuration file.
> The following is manually updated documentation. You could use `lmms_eval task --list` to list all supported tasks and their task names.

- AI2D (ai2d)
- ChartQA (chartqa)
- CMMMU (cmmmu)
- CMMMU Validation (cmmmu_val)
- CMMMU Test (cmmmu_test)
- COCO Caption (coco_cap)
- COCO 2014 Caption (coco2014_cap)
- COCO 2014 Caption Validation (coco2014_cap_val)
- COCO 2014 Caption Test (coco2014_cap_test)
- COCO 2017 Caption (coco2017_cap)
- COCO 2017 Caption MiniVal (coco2017_cap_val)
- COCO 2017 Caption MiniTest (coco2017_cap_test)
- [ConBench](https://github.com/foundation-multimodal-models/ConBench) (conbench)
- DOCVQA (docvqa)
- DOCVQA Validation (docvqa_val)
- DOCVQA Test (docvqa_test)
- Ferret (ferret)
- Flickr30K (flickr30k)
- Ferret Test (ferret_test)
- GQA (gqa)
- HallusionBenchmark (hallusion_bench_image)
- Infographic VQA (info_vqa)
- Infographic VQA Validation (info_vqa_val)
- Infographic VQA Test (info_vqa_test)
- LLaVA-Bench (llava_in_the_wild)
- LLaVA-Bench-COCO (llava_bench_coco)
- MathVerse (mathverse)
- MathVerse Text Dominant (mathverse_testmini_text_dominant)
- MathVerse Text Only (mathverse_testmini_text_only)
- MathVerse Text Lite (mathverse_testmini_text_lite)
- MathVerse Vision Dominant (mathverse_testmini_vision_dominant)
- MathVerse Vision Intensive (mathverse_testmini_vision_intensive)
- MathVerse Vision Only (mathverse_testmini_vision_only)
- MathVista (mathvista)
- MathVista Validation (mathvista_testmini)
- MathVista Test (mathvista_test)
- MMBench (mmbench)
- MMBench English (mmbench_en)
- MMBench English Dev (mmbench_en_dev)
- MMBench English Test (mmbench_en_test)
- MMBench Chinese (mmbench_cn)
- MMBench Chinese Dev (mmbench_cn_dev)
- MMBench Chinese Test (mmbench_cn_test)
- MME (mme)
- MMMU (mmmu)
- MMMU Validation (mmmu_val)
- MMMU Test (mmmu_test)
- MMUPD (mmupd)
- MMUPD Base (mmupd_base)
- MMAAD Base (mmaad_base)
- MMIASD Base (mmiasd_base)
- MMIVQD Base (mmivqd_base)
- MMUPD Option (mmupd_option)
- MMAAD Option (mmaad_option)
- MMIASD Option (mmiasd_option)
- MMIVQD Option (mmivqd_option)
- MMUPD Instruction (mmupd_instruction)
- MMAAD Instruction (mmaad_instruction)
- MMIASD Instruction (mmiasd_instruction)
- MMIVQD Instruction (mmivqd_instruction)
- MMVet (mmvet)
- Multi-DocVQA (multidocvqa)
- Multi-DocVQA Validation (multidocvqa_val)
- Multi-DocVQA Test (multidocvqa_test)
- NoCaps (nocaps)
- NoCaps Validation (nocaps_val)
- NoCaps Test (nocaps_test)
- OKVQA (ok_vqa)
- OKVQA Validation 2014 (ok_vqa_val2014)
- POPE (pope)
- RefCOCO (refcoco)
- refcoco_seg_test
- refcoco_seg_val
- refcoco_seg_testA
- refcoco_seg_testB
- refcoco_bbox_test
- refcoco_bbox_val
- refcoco_bbox_testA
- refcoco_bbox_testB
- RefCOCO+ (refcoco+)
- refcoco+_seg
- refcoco+_seg_val
- refcoco+_seg_testA
- refcoco+_seg_testB
- refcoco+_bbox
- refcoco+_bbox_val
- refcoco+_bbox_testA
- refcoco+_bbox_testB
- RefCOCOg (refcocog)
- refcocog_seg_test
- refcocog_seg_val
- refcocog_bbox_test
- refcocog_bbox_val
- ScienceQA (scienceqa_full)
- ScienceQA Full (scienceqa)
- ScienceQA IMG (scienceqa_img)
- ScreenSpot (screenspot)
- ScreenSpot REC / Grounding (screenspot_rec)
- ScreenSpot REG / Instruction Generation (screenspot_reg)
- SeedBench (seedbench)
- SeedBench 2 (seedbench_2)
- ST-VQA (stvqa)
- TextCaps (textcaps)
- TextCaps Validation (textcaps_val)
- TextCaps Test (textcaps_test)
- TextVQA (textvqa)
- TextVQA Validation (textvqa_val)
- TextVQA Test (textvqa_test)
- VizWizVQA (vizwiz_vqa)
- VizWizVQA Validation (vizwiz_vqa_val)
- VizWizVQA Test (vizwiz_vqa_test)
- VQAv2 (vqav2)
- VQAv2 Validation (vqav2_val)
- VQAv2 Test (vqav2_test)
- WebSRC (websrc)
- WebSRC Validation (websrc_val)
- WebSRC Test (websrc_test)
Empty file modified docs/model_guide.md
100644 → 100755
Empty file.
2 changes: 1 addition & 1 deletion docs/task_guide.md
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ doc_to_target: "answer"
generation_kwargs:
max_new_tokens: 16
temperature: 0
top_p: 0
top_p: 1.0
num_beams: 1
do_sample: false
# The return value of process_results will be used by metrics
Expand Down
15 changes: 0 additions & 15 deletions example_eval.yaml

This file was deleted.

Empty file modified lmms_eval/__init__.py
100644 → 100755
Empty file.
24 changes: 20 additions & 4 deletions lmms_eval/__main__.py
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -106,9 +106,16 @@ def parse_eval_args() -> argparse.Namespace:
parser.add_argument(
"--log_samples_suffix",
type=str,
default="",
default="model_outputs",
help="Specify a suffix for the log_samples file name.",
)
parser.add_argument(
"--predict_only",
"-x",
action="store_true",
default=False,
help="Use with --log_samples. Only model outputs will be saved and metrics will not be evaluated.",
)
parser.add_argument(
"--show_config",
action="store_true",
Expand Down Expand Up @@ -228,6 +235,10 @@ def cli_evaluate_single(args: Union[argparse.Namespace, None] = None) -> None:

initialize_tasks(args.verbosity)

if args.predict_only:
args.log_samples = True
if (args.log_samples or args.predict_only) and not args.output_path:
raise ValueError("Specify --output_path if providing --log_samples or --predict_only")
if args.limit:
eval_logger.warning(" --limit SHOULD ONLY BE USED FOR TESTING." "REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.")
if args.include_path is not None:
Expand Down Expand Up @@ -274,6 +285,10 @@ def cli_evaluate_single(args: Union[argparse.Namespace, None] = None) -> None:
# set datetime before evaluation
datetime_str = utils.get_datetime_str(timezone=args.timezone)
if args.output_path:
if args.log_samples_suffix and len(args.log_samples_suffix) > 15:
eval_logger.warning("The suffix for log_samples is too long. It is recommended to keep it under 15 characters.")
args.log_samples_suffix = args.log_samples_suffix[:5] + "..." + args.log_samples_suffix[-5:]

hash_input = f"{args.model_args}".encode("utf-8")
hash_output = hashlib.sha256(hash_input).hexdigest()[:6]
path = Path(args.output_path)
Expand All @@ -296,6 +311,7 @@ def cli_evaluate_single(args: Union[argparse.Namespace, None] = None) -> None:
log_samples=args.log_samples,
gen_kwargs=args.gen_kwargs,
cli_args=args,
predict_only=args.predict_only,
)

if results is not None:
Expand All @@ -318,9 +334,9 @@ def cli_evaluate_single(args: Union[argparse.Namespace, None] = None) -> None:
for task_name, config in results["configs"].items():
filename = args.output_path.joinpath(f"{task_name}.json")
# Structure the data with 'args' and 'logs' keys
data_to_dump = {"args": vars(args), "model_configs": config, "logs": sorted(samples[task_name], key=lambda x: x["doc_id"])} # Convert Namespace to dict
samples_dumped = json.dumps(data_to_dump, indent=4, default=_handle_non_serializable)
filename.open("w").write(samples_dumped)
data_to_dump = {"args": vars(args), "model_configs": config, "logs": sorted(samples[task_name], key=lambda x: x["doc_id"]), "time": datetime_str}
samples_dumped = json.dumps(data_to_dump, indent=4, default=_handle_non_serializable, ensure_ascii=False)
filename.open("w", encoding="utf-8").write(samples_dumped)
eval_logger.info(f"Saved samples to {filename}")

return results, samples
Expand Down
Empty file modified lmms_eval/api/__init__.py
100644 → 100755
Empty file.
Empty file modified lmms_eval/api/filter.py
100644 → 100755
Empty file.
Empty file modified lmms_eval/api/instance.py
100644 → 100755
Empty file.
15 changes: 15 additions & 0 deletions lmms_eval/api/metrics.py
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,11 @@


# Register Aggregations First
@register_aggregation("bypass")
def bypass_agg(arr):
return 999


@register_aggregation("mean")
def mean(arr):
return sum(arr) / len(arr)
Expand Down Expand Up @@ -226,6 +231,16 @@ def mean_stderr(arr):
return sample_stddev(arr) / math.sqrt(len(arr))


@register_metric(
metric="bypass",
higher_is_better=True,
output_type=["loglikelihood", "multiple_choice", "generate_until"],
aggregation="bypass",
)
def bypass(items):
return items


@register_metric(
metric="mcc",
higher_is_better=True,
Expand Down
Empty file modified lmms_eval/api/model.py
100644 → 100755
Empty file.
18 changes: 18 additions & 0 deletions lmms_eval/api/registry.py
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
from lmms_eval.api.model import lmms

from typing import Callable, Dict
import logging
import evaluate as hf_evaluate

eval_logger = logging.getLogger("lmms-eval")

Expand Down Expand Up @@ -104,6 +106,22 @@ def decorate(fn):
return decorate


def get_metric(name: str, hf_evaluate_metric=False) -> Callable:
if not hf_evaluate_metric:
if name in METRIC_REGISTRY:
return METRIC_REGISTRY[name]
else:
eval_logger.warning(f"Could not find registered metric '{name}' in lm-eval, searching in HF Evaluate library...")

try:
metric_object = hf_evaluate.load(name)
return metric_object.compute
except Exception:
eval_logger.error(
f"{name} not found in the evaluate library! Please check https://huggingface.co/evaluate-metric",
)


def register_aggregation(name):
def decorate(fn):
assert name not in AGGREGATION_REGISTRY, f"aggregation named '{name}' conflicts with existing registered aggregation!"
Expand Down
Empty file modified lmms_eval/api/samplers.py
100644 → 100755
Empty file.
Loading
Loading