Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multilingual LLava bench #56

Merged
merged 2 commits into from
May 2, 2024
Merged

Multilingual LLava bench #56

merged 2 commits into from
May 2, 2024

Conversation

gagan3012
Copy link
Contributor

This is the task to run multilingual llava bench from MBZUAI

@Luodian
Copy link
Contributor

Luodian commented Apr 13, 2024

Thanks for this PR, it's clean since it's only adding files in tasks folder.

May I ask that if you could also give a screenshot to see a standard model (e.g. llava-v1.5-7b) results on multilingual-llava-bench?

Luodian pushed a commit that referenced this pull request Apr 16, 2024
* refactor vizwizvqa task

* Delete vqav2_test and vqav2_val YAML files

* Refactor vqav2_process_results functions

* Add a pack for vqav2

* refactor okvqa

* roll back vizwiz_vqa

* Fix exact_match calculation in ok_vqa_process_results

* Update OKVQA dataset name in readme

* add model_specific_prompt_kwargs

* add model_specific_prompt_kwargs to vizwiz_vqa

* add model_specific_prompt_kwargs for vqav2

* lint

* fix a small bug for eval_logger

* Refactor make_table function to display points as "  -  " if value is None

* Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

* Refactor ok_vqa_aggreate_submissions function

* Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

* Refactor VQA submission file saving

* Update file utils

* Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

* Refactor file path handling and submission generation

* OKVQA path

* vizwizvqa file

* pack cmmmu

* fix a small metric bug for cmmmu

* Add higher_is_better flag to submission metric

* Add CMMMU dataset to README.md

* Add logging and refactor submission file generation in docvqa utils.py

* pack docvqa

* add traceback to print detailed error

* Refactor docvqa_test_aggregate_results to accept additional arguments

* Add metric check in evaluator.py and update test.yaml and val.yaml

* add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

* merge textvqa

* textvqa

* Modify submission file generation for COCO test results

* Update test result storage path

* update coco cap file name

* Update COCO 2017 Caption dataset name

* ferret

* Add Ferret dataset

* Refactor hb_doc_to_text function to include model-specific prompts

* Add IconQA and its subtasks

* Refactor image list creation in doc_to_visual function

* Add process_results function to default template

* Update process_results function in iconqa utils.py

* refactor flickr30k

* change aggregation function

* Fix formatting issues and update logging message

* Fix llava can not handle only text question (no visuals)

* Fix qwen can not handle no image question (no visuals)

* Add fuyu prepare accelerator scripts

* refactor mme

* naming consistency

* aggregation_submissions consistency

* flickr30k naming consistency

* remove submissions for mme

* remove unused submission function

* Refactor infovqa_test.yaml and infovqa_val.yaml

* Refactor code for improved readability and maintainability

* stvqa

* remane sqa

* Update lmms_eval textcaps files and utils.py

* Update default prompt for text captions

* Refactor textcaps_aggregation_result function

* Add generate_submission_file function and update mathvista_aggregate_results signature

* Update nocaps_test.yaml and nocaps_val.yaml

* refractor internal_eval

* Add internal evaluation datasets

* pack multidocvqa

* mmvet

* Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

* Refractor llava wild

* Refractor llava-bench-coco

* Add JSON file generation for gpt evaluation details

* mmmu

* Remove MMBench English and Chinese tasks

* Remove unnecessary return statement in mmbench_aggregate_test_results function

* Fix distributed process group initialization

* Update dataset paths and group names in mmbench test configs

* Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

* Add torch module import

* lint

* Remove IconQA dataset from README.md

* Add Multi-DocVQA and its submodules

* Add new datasets and update task names

* Refactor flickr_aggregation_result function to accept additional arguments

* Add timeout kwargs in Accelerator constructor

* Add encoding to be utf-8 for cmmmu

* Fix llava try and catch, remove torch.distributed.init in main

* Ds prepare script for llava

---------

Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
Co-authored-by: kcz358 <kaichenzhang358@outlook.com>
Luodian pushed a commit that referenced this pull request Apr 16, 2024
* refactor vizwizvqa task

* Delete vqav2_test and vqav2_val YAML files

* Refactor vqav2_process_results functions

* Add a pack for vqav2

* refactor okvqa

* roll back vizwiz_vqa

* Fix exact_match calculation in ok_vqa_process_results

* Update OKVQA dataset name in readme

* add model_specific_prompt_kwargs

* add model_specific_prompt_kwargs to vizwiz_vqa

* add model_specific_prompt_kwargs for vqav2

* lint

* fix a small bug for eval_logger

* Refactor make_table function to display points as "  -  " if value is None

* Merge commit '90f42f0876a4914c5ac0d213b9dffbfb4797ff62'

* Refactor ok_vqa_aggreate_submissions function

* Merge commit '4afec3303a0a7ed27a8265565343bf2851b9e4c7'

* Refactor VQA submission file saving

* Update file utils

* Merge commit 'c144b75f0c9145a625b2bbdef5123ed81e343a11'

* Refactor file path handling and submission generation

* OKVQA path

* vizwizvqa file

* pack cmmmu

* fix a small metric bug for cmmmu

* Add higher_is_better flag to submission metric

* Add CMMMU dataset to README.md

* Add logging and refactor submission file generation in docvqa utils.py

* pack docvqa

* add traceback to print detailed error

* Refactor docvqa_test_aggregate_results to accept additional arguments

* Add metric check in evaluator.py and update test.yaml and val.yaml

* add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

* merge textvqa

* textvqa

* Modify submission file generation for COCO test results

* Update test result storage path

* update coco cap file name

* Update COCO 2017 Caption dataset name

* ferret

* Add Ferret dataset

* Refactor hb_doc_to_text function to include model-specific prompts

* Add IconQA and its subtasks

* Refactor image list creation in doc_to_visual function

* Add process_results function to default template

* Update process_results function in iconqa utils.py

* refactor flickr30k

* change aggregation function

* Fix formatting issues and update logging message

* Fix llava can not handle only text question (no visuals)

* Fix qwen can not handle no image question (no visuals)

* Add fuyu prepare accelerator scripts

* refactor mme

* naming consistency

* aggregation_submissions consistency

* flickr30k naming consistency

* remove submissions for mme

* remove unused submission function

* Refactor infovqa_test.yaml and infovqa_val.yaml

* Refactor code for improved readability and maintainability

* stvqa

* remane sqa

* Update lmms_eval textcaps files and utils.py

* Update default prompt for text captions

* Refactor textcaps_aggregation_result function

* Add generate_submission_file function and update mathvista_aggregate_results signature

* Update nocaps_test.yaml and nocaps_val.yaml

* refractor internal_eval

* Add internal evaluation datasets

* pack multidocvqa

* mmvet

* Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

* Refractor llava wild

* Refractor llava-bench-coco

* Add JSON file generation for gpt evaluation details

* mmmu

* Remove MMBench English and Chinese tasks

* Remove unnecessary return statement in mmbench_aggregate_test_results function

* Fix distributed process group initialization

* Update dataset paths and group names in mmbench test configs

* Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

* Add torch module import

* lint

* Remove IconQA dataset from README.md

* Add Multi-DocVQA and its submodules

* Add new datasets and update task names

* Refactor flickr_aggregation_result function to accept additional arguments

* Add timeout kwargs in Accelerator constructor

* Add encoding to be utf-8 for cmmmu

* Fix llava try and catch, remove torch.distributed.init in main

* Ds prepare script for llava

---------

Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
Co-authored-by: kcz358 <kaichenzhang358@outlook.com>
Luodian added a commit that referenced this pull request Apr 16, 2024
…d context issue (#59)

* Refactor logging in lmms_eval package

* Refactor variable names in lmms_eval package

* Update README.md with new features and installation instructions

* Update supported models and datasets

* Delete otter.py file

* Fix capitalization in README.md

* Update image sizes and add new features

* Refactor README.md to improve readability and add new features

* Add description for lmms-eval in README.md

* Update accelerator support in README.md

* Update lmms-eval README with improved description and additional features

* Update README.md with improved task grouping description

* change `Otter-AI/MME` to `lmms-lab/MME`

* Update README.md

* Update README.md

* Remove unused code in mme.yaml

* Squashed commit of the following:

commit 6b20902
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 21050ba
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit ba0e7f5
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* add llava main in pyproject

* Update README.md

* Remove unnecessary dependencies and add specific version for llava_repr

* Add dependencies for llava_repr***

* Update README.md

* add some docs on models and command line commands

* remove some lines

* typo

* Update model_guide.md

* Update model_guide.md

* Update README.md

* Update README.md

* Update README.md

* Fix refcocog dataset path

* Record gpt response in eval info

* Resolve conflict

* Fix hallusionbench gpt json saving path

* Rename hallubench gpt output path

* Change remove image to check by type instead of check by names

* More robust check by type

* Remove unnecessary img in data

* Forcing an empty commit.

* Testing

* Delete unnecessary things

* Fix seedbench2 image issue in doc_to_text

* Add conditional exclude for internal eval

* Fix small bugs in list_with_num

* Revise list_with_num model args

---------

Co-authored-by: Bo Li <drluodian@gmail.com>
Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
Co-authored-by: jzhang38 <a1286225768@gmail.com>
Luodian added a commit that referenced this pull request Apr 16, 2024
…d context issue (#59)

* Refactor logging in lmms_eval package

* Refactor variable names in lmms_eval package

* Update README.md with new features and installation instructions

* Update supported models and datasets

* Delete otter.py file

* Fix capitalization in README.md

* Update image sizes and add new features

* Refactor README.md to improve readability and add new features

* Add description for lmms-eval in README.md

* Update accelerator support in README.md

* Update lmms-eval README with improved description and additional features

* Update README.md with improved task grouping description

* change `Otter-AI/MME` to `lmms-lab/MME`

* Update README.md

* Update README.md

* Remove unused code in mme.yaml

* Squashed commit of the following:

commit 2a45079
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 7bdab7a
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit '90f42f0876a4914c5ac0d213b9dffbfb4797ff62'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit '4afec3303a0a7ed27a8265565343bf2851b9e4c7'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit 'c144b75f0c9145a625b2bbdef5123ed81e343a11'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit d3dfd94
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* add llava main in pyproject

* Update README.md

* Remove unnecessary dependencies and add specific version for llava_repr

* Add dependencies for llava_repr***

* Update README.md

* add some docs on models and command line commands

* remove some lines

* typo

* Update model_guide.md

* Update model_guide.md

* Update README.md

* Update README.md

* Update README.md

* Fix refcocog dataset path

* Record gpt response in eval info

* Resolve conflict

* Fix hallusionbench gpt json saving path

* Rename hallubench gpt output path

* Change remove image to check by type instead of check by names

* More robust check by type

* Remove unnecessary img in data

* Forcing an empty commit.

* Testing

* Delete unnecessary things

* Fix seedbench2 image issue in doc_to_text

* Add conditional exclude for internal eval

* Fix small bugs in list_with_num

* Revise list_with_num model args

---------

Co-authored-by: Bo Li <drluodian@gmail.com>
Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
Co-authored-by: jzhang38 <a1286225768@gmail.com>
Luodian added a commit that referenced this pull request Apr 16, 2024
* Refactor logging in lmms_eval package

* Refactor variable names in lmms_eval package

* Update README.md with new features and installation instructions

* Update supported models and datasets

* Delete otter.py file

* Fix capitalization in README.md

* Update image sizes and add new features

* Refactor README.md to improve readability and add new features

* Add description for lmms-eval in README.md

* Update accelerator support in README.md

* Update lmms-eval README with improved description and additional features

* Update README.md with improved task grouping description

* change `Otter-AI/MME` to `lmms-lab/MME`

* Update README.md

* Update README.md

* Remove unused code in mme.yaml

* Squashed commit of the following:

commit 6b20902
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 21050ba
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit ba0e7f5
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* add llava main in pyproject

* Update README.md

* Remove unnecessary dependencies and add specific version for llava_repr

* Add dependencies for llava_repr***

* Update README.md

* add some docs on models and command line commands

* remove some lines

* typo

* Update model_guide.md

* Update model_guide.md

* Update README.md

* Update README.md

* Update README.md

* Fix refcocog dataset path

* Record gpt response in eval info

* Resolve conflict

* Fix hallusionbench gpt json saving path

* Rename hallubench gpt output path

* Change remove image to check by type instead of check by names

* More robust check by type

* Remove unnecessary img in data

* Forcing an empty commit.

* Testing

* Delete unnecessary things

* Fix seedbench2 image issue in doc_to_text

* Add conditional exclude for internal eval

* Fix small bugs in list_with_num

* Revise list_with_num model args

* Fix logging utils bug on wandb grouping

---------

Co-authored-by: Bo Li <drluodian@gmail.com>
Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
Co-authored-by: jzhang38 <a1286225768@gmail.com>
Luodian added a commit that referenced this pull request Apr 16, 2024
* Refactor logging in lmms_eval package

* Refactor variable names in lmms_eval package

* Update README.md with new features and installation instructions

* Update supported models and datasets

* Delete otter.py file

* Fix capitalization in README.md

* Update image sizes and add new features

* Refactor README.md to improve readability and add new features

* Add description for lmms-eval in README.md

* Update accelerator support in README.md

* Update lmms-eval README with improved description and additional features

* Update README.md with improved task grouping description

* change `Otter-AI/MME` to `lmms-lab/MME`

* Update README.md

* Update README.md

* Remove unused code in mme.yaml

* Squashed commit of the following:

commit 2a45079
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 7bdab7a
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit '90f42f0876a4914c5ac0d213b9dffbfb4797ff62'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit '4afec3303a0a7ed27a8265565343bf2851b9e4c7'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit 'c144b75f0c9145a625b2bbdef5123ed81e343a11'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit d3dfd94
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* add llava main in pyproject

* Update README.md

* Remove unnecessary dependencies and add specific version for llava_repr

* Add dependencies for llava_repr***

* Update README.md

* add some docs on models and command line commands

* remove some lines

* typo

* Update model_guide.md

* Update model_guide.md

* Update README.md

* Update README.md

* Update README.md

* Fix refcocog dataset path

* Record gpt response in eval info

* Resolve conflict

* Fix hallusionbench gpt json saving path

* Rename hallubench gpt output path

* Change remove image to check by type instead of check by names

* More robust check by type

* Remove unnecessary img in data

* Forcing an empty commit.

* Testing

* Delete unnecessary things

* Fix seedbench2 image issue in doc_to_text

* Add conditional exclude for internal eval

* Fix small bugs in list_with_num

* Revise list_with_num model args

* Fix logging utils bug on wandb grouping

---------

Co-authored-by: Bo Li <drluodian@gmail.com>
Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
Co-authored-by: jzhang38 <a1286225768@gmail.com>
Luodian added a commit that referenced this pull request Apr 16, 2024
)

* Refactor logging in lmms_eval package

* Refactor variable names in lmms_eval package

* Update README.md with new features and installation instructions

* Update supported models and datasets

* Delete otter.py file

* Fix capitalization in README.md

* Update image sizes and add new features

* Refactor README.md to improve readability and add new features

* Add description for lmms-eval in README.md

* Update accelerator support in README.md

* Update lmms-eval README with improved description and additional features

* Update README.md with improved task grouping description

* change `Otter-AI/MME` to `lmms-lab/MME`

* Update README.md

* Update README.md

* Remove unused code in mme.yaml

* Squashed commit of the following:

commit 6b20902
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 21050ba
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit ba0e7f5
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* add llava main in pyproject

* Update README.md

* Remove unnecessary dependencies and add specific version for llava_repr

* Add dependencies for llava_repr***

* Update README.md

* add some docs on models and command line commands

* remove some lines

* typo

* Update model_guide.md

* Update model_guide.md

* Update README.md

* Update README.md

* Update README.md

* Fix refcocog dataset path

* Record gpt response in eval info

* Resolve conflict

* Fix hallusionbench gpt json saving path

* Rename hallubench gpt output path

* Change remove image to check by type instead of check by names

* More robust check by type

* Add timeout to API requests

* Remove unnecessary img in data

* Forcing an empty commit.

* Testing

* Delete unnecessary things

* Fix error logging in get_chat_response function

* Fix seedbench2 image issue in doc_to_text

* Add conditional exclude for internal eval

* Squashed commit of the following:

commit faf9cf65cf5b1e036ee3a74428e8bb1490e8b2eb
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:49:36 2024 +0000

    Add conditional exclude for internal eval

commit e3729eb925b718a44b6eb225ef9b41c7fd2408e0
Merge: a3cae8e ffb9eb2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:24:29 2024 +0000

    Merge branch 'dev/readme' into kc/final_fix

commit 50b697a7ae93b0547484e1cd753722c1d2513349
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 02:47:31 2024 +0000

    Fix seedbench2 image issue in doc_to_text

commit 17425b5dce41cf67b96c5875139b57d6c7a423df
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:32:49 2024 +0000

    Delete unnecessary things

commit 1bc17d54e79e79d11419ba89e7d8e55bc8cfa21b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:31:42 2024 +0000

    Testing

commit a20bbc30ab576d3e2a587c70af1b7c06575bcd8b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:29:30 2024 +0000

    Forcing an empty commit.

commit e2b657694b888ef59b9f896415e7c4c82497e7bf
Merge: 786f2b5 1700786
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:56 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 6447d521842b9f83f5119cdcd7714c8f6053ca73
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:20 2024 +0000

    Remove unnecessary img in data

commit 8ac333a2e9ebbe6318d536b6589f767f71fbc092
Merge: 4240785 888c1c1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:41:24 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 9e542ce049f68f49a237be165e3ad9cde7408ac0
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:40:51 2024 +0000

    More robust check by type

commit f90ccf7b94b130e118b4eca321f68b81e7ab5850
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:00:57 2024 +0000

    Change remove image to check by type instead of check by names

commit f651a77707a4c723ebffb07f2a87743bf42ecea7
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 12:33:02 2024 +0000

    Rename hallubench gpt output path

commit a683559c704806b7abde5e4c8355f556f3e65866
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 09:32:52 2024 +0000

    Fix hallusionbench gpt json saving path

commit 8e246e2466f3dd14a5e34f720269d7991a6dcf6b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:51:13 2024 +0000

    Resolve conflict

commit 67f00dc4652d09c662e5202ff7e5fbf7bebcdaf6
Merge: 9cf86fa 93534dc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:37:21 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 53b7a845fe8412a652905101ec036c84e77a20c2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:55:03 2024 +0000

    Record gpt response in eval info

commit 920b4112c4508e9a8afe824678958f2e78189e4e
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:49:01 2024 +0000

    Fix refcocog dataset path

commit 6b20902
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 21050ba
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit ba0e7f5
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* Fix small bugs in list_with_num

* Revise list_with_num model args

* Dev/readme rm rolling (#60)

* remove log_likelyhood_rolling

* Update time efficiency benchmark in README.md

* add task guide

---------

Co-authored-by: jzhang38 <a1286225768@gmail.com>
Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>

* Remove unnecessary code and update dependencies

* Fix logging utils bug on wandb grouping

* Add reproduce envs

* Squashed commit of the following:

commit 74fff73053b88a90d0f4229a9c748256080fea08
Merge: 2475639 f89a736
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 22:12:12 2024 +0800

    Merge branch 'main' into kc/final_fix

commit 0c640a636e3882859a17e30a5c3504850a3d02d6
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 22:11:04 2024 +0800

    Add reproduce envs

commit 7f2b2c3
Author: kcz358 <92624596+kcz358@users.noreply.github.com>
Date:   Sun Mar 3 21:19:15 2024 +0800

    [Fix] wandb group logging missing columns (#61)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

    * Update README.md with new features and installation instructions

    * Update supported models and datasets

    * Delete otter.py file

    * Fix capitalization in README.md

    * Update image sizes and add new features

    * Refactor README.md to improve readability and add new features

    * Add description for lmms-eval in README.md

    * Update accelerator support in README.md

    * Update lmms-eval README with improved description and additional features

    * Update README.md with improved task grouping description

    * change `Otter-AI/MME` to `lmms-lab/MME`

    * Update README.md

    * Update README.md

    * Remove unused code in mme.yaml

    * Squashed commit of the following:

    commit 6b20902
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Thu Feb 29 13:40:02 2024 +0800

        Dev/py add models (#57)

        * add instructblip

        * minicpm_v

        * remove <image> from qwen-vl

        * speed up postprocessing

        * Optimize build context speed

        ---------

        Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 21050ba
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Wed Feb 28 14:49:07 2024 +0800

        Pufanyi/flickr30k refractor (#56)

        * refactor vizwizvqa task

        * Delete vqav2_test and vqav2_val YAML files

        * Refactor vqav2_process_results functions

        * Add a pack for vqav2

        * refactor okvqa

        * roll back vizwiz_vqa

        * Fix exact_match calculation in ok_vqa_process_results

        * Update OKVQA dataset name in readme

        * add model_specific_prompt_kwargs

        * add model_specific_prompt_kwargs to vizwiz_vqa

        * add model_specific_prompt_kwargs for vqav2

        * lint

        * fix a small bug for eval_logger

        * Refactor make_table function to display points as "  -  " if value is None

        * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

        * Refactor ok_vqa_aggreate_submissions function

        * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

        * Refactor VQA submission file saving

        * Update file utils

        * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

        * Refactor file path handling and submission generation

        * OKVQA path

        * vizwizvqa file

        * pack cmmmu

        * fix a small metric bug for cmmmu

        * Add higher_is_better flag to submission metric

        * Add CMMMU dataset to README.md

        * Add logging and refactor submission file generation in docvqa utils.py

        * pack docvqa

        * add traceback to print detailed error

        * Refactor docvqa_test_aggregate_results to accept additional arguments

        * Add metric check in evaluator.py and update test.yaml and val.yaml

        * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

        * merge textvqa

        * textvqa

        * Modify submission file generation for COCO test results

        * Update test result storage path

        * update coco cap file name

        * Update COCO 2017 Caption dataset name

        * ferret

        * Add Ferret dataset

        * Refactor hb_doc_to_text function to include model-specific prompts

        * Add IconQA and its subtasks

        * Refactor image list creation in doc_to_visual function

        * Add process_results function to default template

        * Update process_results function in iconqa utils.py

        * refactor flickr30k

        * change aggregation function

        * Fix formatting issues and update logging message

        * Fix llava can not handle only text question (no visuals)

        * Fix qwen can not handle no image question (no visuals)

        * Add fuyu prepare accelerator scripts

        * refactor mme

        * naming consistency

        * aggregation_submissions consistency

        * flickr30k naming consistency

        * remove submissions for mme

        * remove unused submission function

        * Refactor infovqa_test.yaml and infovqa_val.yaml

        * Refactor code for improved readability and maintainability

        * stvqa

        * remane sqa

        * Update lmms_eval textcaps files and utils.py

        * Update default prompt for text captions

        * Refactor textcaps_aggregation_result function

        * Add generate_submission_file function and update mathvista_aggregate_results signature

        * Update nocaps_test.yaml and nocaps_val.yaml

        * refractor internal_eval

        * Add internal evaluation datasets

        * pack multidocvqa

        * mmvet

        * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

        * Refractor llava wild

        * Refractor llava-bench-coco

        * Add JSON file generation for gpt evaluation details

        * mmmu

        * Remove MMBench English and Chinese tasks

        * Remove unnecessary return statement in mmbench_aggregate_test_results function

        * Fix distributed process group initialization

        * Update dataset paths and group names in mmbench test configs

        * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

        * Add torch module import

        * lint

        * Remove IconQA dataset from README.md

        * Add Multi-DocVQA and its submodules

        * Add new datasets and update task names

        * Refactor flickr_aggregation_result function to accept additional arguments

        * Add timeout kwargs in Accelerator constructor

        * Add encoding to be utf-8 for cmmmu

        * Fix llava try and catch, remove torch.distributed.init in main

        * Ds prepare script for llava

        ---------

        Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit ba0e7f5
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Feb 27 22:52:07 2024 +0800

        [Wandb Logger] add models, and args to wandb tables. (#55)

        * Refactor logging in lmms_eval package

        * Refactor variable names in lmms_eval package

    * add llava main in pyproject

    * Update README.md

    * Remove unnecessary dependencies and add specific version for llava_repr

    * Add dependencies for llava_repr***

    * Update README.md

    * add some docs on models and command line commands

    * remove some lines

    * typo

    * Update model_guide.md

    * Update model_guide.md

    * Update README.md

    * Update README.md

    * Update README.md

    * Fix refcocog dataset path

    * Record gpt response in eval info

    * Resolve conflict

    * Fix hallusionbench gpt json saving path

    * Rename hallubench gpt output path

    * Change remove image to check by type instead of check by names

    * More robust check by type

    * Remove unnecessary img in data

    * Forcing an empty commit.

    * Testing

    * Delete unnecessary things

    * Fix seedbench2 image issue in doc_to_text

    * Add conditional exclude for internal eval

    * Fix small bugs in list_with_num

    * Revise list_with_num model args

    * Fix logging utils bug on wandb grouping

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Co-authored-by: jzhang38 <a1286225768@gmail.com>

commit bebff9fad2a60bc0ac52ddc430e5d9e4e0ef6c24
Merge: 83358a4 5e1c9c7
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 07:25:48 2024 +0000

    Merge branch 'main' into kc/final_fix

commit 5042bb0c2ed4f830dda6bcd14231b1f8763aa95f
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 07:23:19 2024 +0000

    Fix logging utils bug on wandb grouping

commit c82042b
Author: kcz358 <92624596+kcz358@users.noreply.github.com>
Date:   Sun Mar 3 13:01:11 2024 +0800

    [Fix] refcocog dataset path, record gpt prompt in internal eval, build context issue (#59)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

    * Update README.md with new features and installation instructions

    * Update supported models and datasets

    * Delete otter.py file

    * Fix capitalization in README.md

    * Update image sizes and add new features

    * Refactor README.md to improve readability and add new features

    * Add description for lmms-eval in README.md

    * Update accelerator support in README.md

    * Update lmms-eval README with improved description and additional features

    * Update README.md with improved task grouping description

    * change `Otter-AI/MME` to `lmms-lab/MME`

    * Update README.md

    * Update README.md

    * Remove unused code in mme.yaml

    * Squashed commit of the following:

    commit 6b20902
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Thu Feb 29 13:40:02 2024 +0800

        Dev/py add models (#57)

        * add instructblip

        * minicpm_v

        * remove <image> from qwen-vl

        * speed up postprocessing

        * Optimize build context speed

        ---------

        Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 21050ba
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Wed Feb 28 14:49:07 2024 +0800

        Pufanyi/flickr30k refractor (#56)

        * refactor vizwizvqa task

        * Delete vqav2_test and vqav2_val YAML files

        * Refactor vqav2_process_results functions

        * Add a pack for vqav2

        * refactor okvqa

        * roll back vizwiz_vqa

        * Fix exact_match calculation in ok_vqa_process_results

        * Update OKVQA dataset name in readme

        * add model_specific_prompt_kwargs

        * add model_specific_prompt_kwargs to vizwiz_vqa

        * add model_specific_prompt_kwargs for vqav2

        * lint

        * fix a small bug for eval_logger

        * Refactor make_table function to display points as "  -  " if value is None

        * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

        * Refactor ok_vqa_aggreate_submissions function

        * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

        * Refactor VQA submission file saving

        * Update file utils

        * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

        * Refactor file path handling and submission generation

        * OKVQA path

        * vizwizvqa file

        * pack cmmmu

        * fix a small metric bug for cmmmu

        * Add higher_is_better flag to submission metric

        * Add CMMMU dataset to README.md

        * Add logging and refactor submission file generation in docvqa utils.py

        * pack docvqa

        * add traceback to print detailed error

        * Refactor docvqa_test_aggregate_results to accept additional arguments

        * Add metric check in evaluator.py and update test.yaml and val.yaml

        * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

        * merge textvqa

        * textvqa

        * Modify submission file generation for COCO test results

        * Update test result storage path

        * update coco cap file name

        * Update COCO 2017 Caption dataset name

        * ferret

        * Add Ferret dataset

        * Refactor hb_doc_to_text function to include model-specific prompts

        * Add IconQA and its subtasks

        * Refactor image list creation in doc_to_visual function

        * Add process_results function to default template

        * Update process_results function in iconqa utils.py

        * refactor flickr30k

        * change aggregation function

        * Fix formatting issues and update logging message

        * Fix llava can not handle only text question (no visuals)

        * Fix qwen can not handle no image question (no visuals)

        * Add fuyu prepare accelerator scripts

        * refactor mme

        * naming consistency

        * aggregation_submissions consistency

        * flickr30k naming consistency

        * remove submissions for mme

        * remove unused submission function

        * Refactor infovqa_test.yaml and infovqa_val.yaml

        * Refactor code for improved readability and maintainability

        * stvqa

        * remane sqa

        * Update lmms_eval textcaps files and utils.py

        * Update default prompt for text captions

        * Refactor textcaps_aggregation_result function

        * Add generate_submission_file function and update mathvista_aggregate_results signature

        * Update nocaps_test.yaml and nocaps_val.yaml

        * refractor internal_eval

        * Add internal evaluation datasets

        * pack multidocvqa

        * mmvet

        * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

        * Refractor llava wild

        * Refractor llava-bench-coco

        * Add JSON file generation for gpt evaluation details

        * mmmu

        * Remove MMBench English and Chinese tasks

        * Remove unnecessary return statement in mmbench_aggregate_test_results function

        * Fix distributed process group initialization

        * Update dataset paths and group names in mmbench test configs

        * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

        * Add torch module import

        * lint

        * Remove IconQA dataset from README.md

        * Add Multi-DocVQA and its submodules

        * Add new datasets and update task names

        * Refactor flickr_aggregation_result function to accept additional arguments

        * Add timeout kwargs in Accelerator constructor

        * Add encoding to be utf-8 for cmmmu

        * Fix llava try and catch, remove torch.distributed.init in main

        * Ds prepare script for llava

        ---------

        Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit ba0e7f5
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Feb 27 22:52:07 2024 +0800

        [Wandb Logger] add models, and args to wandb tables. (#55)

        * Refactor logging in lmms_eval package

        * Refactor variable names in lmms_eval package

    * add llava main in pyproject

    * Update README.md

    * Remove unnecessary dependencies and add specific version for llava_repr

    * Add dependencies for llava_repr***

    * Update README.md

    * add some docs on models and command line commands

    * remove some lines

    * typo

    * Update model_guide.md

    * Update model_guide.md

    * Update README.md

    * Update README.md

    * Update README.md

    * Fix refcocog dataset path

    * Record gpt response in eval info

    * Resolve conflict

    * Fix hallusionbench gpt json saving path

    * Rename hallubench gpt output path

    * Change remove image to check by type instead of check by names

    * More robust check by type

    * Remove unnecessary img in data

    * Forcing an empty commit.

    * Testing

    * Delete unnecessary things

    * Fix seedbench2 image issue in doc_to_text

    * Add conditional exclude for internal eval

    * Fix small bugs in list_with_num

    * Revise list_with_num model args

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Co-authored-by: jzhang38 <a1286225768@gmail.com>

commit d78a3d7a53f5285a7eac39ce8f04e9854fdb3e73
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 05:58:08 2024 +0000

    Revise list_with_num model args

commit 8eefaec8489d48613de9395eb8e8150224985e01
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 05:09:15 2024 +0000

    Fix small bugs in list_with_num

commit faf9cf65cf5b1e036ee3a74428e8bb1490e8b2eb
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:49:36 2024 +0000

    Add conditional exclude for internal eval

commit e3729eb925b718a44b6eb225ef9b41c7fd2408e0
Merge: a3cae8e ffb9eb2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:24:29 2024 +0000

    Merge branch 'dev/readme' into kc/final_fix

commit 50b697a7ae93b0547484e1cd753722c1d2513349
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 02:47:31 2024 +0000

    Fix seedbench2 image issue in doc_to_text

commit 17425b5dce41cf67b96c5875139b57d6c7a423df
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:32:49 2024 +0000

    Delete unnecessary things

commit 1bc17d54e79e79d11419ba89e7d8e55bc8cfa21b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:31:42 2024 +0000

    Testing

commit a20bbc30ab576d3e2a587c70af1b7c06575bcd8b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:29:30 2024 +0000

    Forcing an empty commit.

commit e2b657694b888ef59b9f896415e7c4c82497e7bf
Merge: 786f2b5 1700786
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:56 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 6447d521842b9f83f5119cdcd7714c8f6053ca73
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:20 2024 +0000

    Remove unnecessary img in data

commit 8ac333a2e9ebbe6318d536b6589f767f71fbc092
Merge: 4240785 888c1c1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:41:24 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 9e542ce049f68f49a237be165e3ad9cde7408ac0
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:40:51 2024 +0000

    More robust check by type

commit f90ccf7b94b130e118b4eca321f68b81e7ab5850
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:00:57 2024 +0000

    Change remove image to check by type instead of check by names

commit f651a77707a4c723ebffb07f2a87743bf42ecea7
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 12:33:02 2024 +0000

    Rename hallubench gpt output path

commit a683559c704806b7abde5e4c8355f556f3e65866
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 09:32:52 2024 +0000

    Fix hallusionbench gpt json saving path

commit 8e246e2466f3dd14a5e34f720269d7991a6dcf6b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:51:13 2024 +0000

    Resolve conflict

commit 67f00dc4652d09c662e5202ff7e5fbf7bebcdaf6
Merge: 9cf86fa 93534dc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:37:21 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 53b7a845fe8412a652905101ec036c84e77a20c2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:55:03 2024 +0000

    Record gpt response in eval info

commit 920b4112c4508e9a8afe824678958f2e78189e4e
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:49:01 2024 +0000

    Fix refcocog dataset path

commit 6b20902
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 21050ba
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit ba0e7f5
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* Update commands.md

* Add repr_scripts for reference

* Add timeout for gpt4V

* Remove unnecessary dependencies

* Add reproduce into readme

* Revise seedbench process_result

* Fix exclude dc hardcode postprocess logic error

* Fix metric repeat issue

* Update dataset runtime and add environment info

* Revise val submission file saving path

* Put the correct query into the gpt extraction

* Update sleep time in utils.py

* update

---------

Co-authored-by: Bo Li <drluodian@gmail.com>
Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
Co-authored-by: jzhang38 <a1286225768@gmail.com>
Luodian added a commit that referenced this pull request Apr 16, 2024
)

* Refactor logging in lmms_eval package

* Refactor variable names in lmms_eval package

* Update README.md with new features and installation instructions

* Update supported models and datasets

* Delete otter.py file

* Fix capitalization in README.md

* Update image sizes and add new features

* Refactor README.md to improve readability and add new features

* Add description for lmms-eval in README.md

* Update accelerator support in README.md

* Update lmms-eval README with improved description and additional features

* Update README.md with improved task grouping description

* change `Otter-AI/MME` to `lmms-lab/MME`

* Update README.md

* Update README.md

* Remove unused code in mme.yaml

* Squashed commit of the following:

commit 2a45079
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 7bdab7a
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit '90f42f0876a4914c5ac0d213b9dffbfb4797ff62'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit '4afec3303a0a7ed27a8265565343bf2851b9e4c7'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit 'c144b75f0c9145a625b2bbdef5123ed81e343a11'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit d3dfd94
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* add llava main in pyproject

* Update README.md

* Remove unnecessary dependencies and add specific version for llava_repr

* Add dependencies for llava_repr***

* Update README.md

* add some docs on models and command line commands

* remove some lines

* typo

* Update model_guide.md

* Update model_guide.md

* Update README.md

* Update README.md

* Update README.md

* Fix refcocog dataset path

* Record gpt response in eval info

* Resolve conflict

* Fix hallusionbench gpt json saving path

* Rename hallubench gpt output path

* Change remove image to check by type instead of check by names

* More robust check by type

* Add timeout to API requests

* Remove unnecessary img in data

* Forcing an empty commit.

* Testing

* Delete unnecessary things

* Fix error logging in get_chat_response function

* Fix seedbench2 image issue in doc_to_text

* Add conditional exclude for internal eval

* Squashed commit of the following:

commit 2fbeafc882c80242a10381abc67629d5d8b7071a
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:49:36 2024 +0000

    Add conditional exclude for internal eval

commit f188052450bed2f3a30ab6f9a6f7eb844a64cb33
Merge: a3cae8e ffb9eb2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:24:29 2024 +0000

    Merge branch 'dev/readme' into kc/final_fix

commit baef5905505892593fe783beb18a2de20991d6af
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 02:47:31 2024 +0000

    Fix seedbench2 image issue in doc_to_text

commit 11b46f3b701b79b361dd5175a263e4d89bd07fb5
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:32:49 2024 +0000

    Delete unnecessary things

commit 0982de2e7a2310429e51ec7828886fd49953f716
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:31:42 2024 +0000

    Testing

commit f840ed80f4ae467fff62b61844854a3a9e8ec8a5
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:29:30 2024 +0000

    Forcing an empty commit.

commit 80db78f600d07011188983637c94da84b9475fbf
Merge: 786f2b5 1700786
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:56 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 676229de870b8d465cef08867cd272a4b696e630
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:20 2024 +0000

    Remove unnecessary img in data

commit d293b96fb3537fea85f10f216d762abf35e05e8d
Merge: 4240785 888c1c1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:41:24 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 01bbd010590d6b7f105525580209191a1d6d5232
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:40:51 2024 +0000

    More robust check by type

commit 66595ebc073ff9431f2400006196c0645be58ea4
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:00:57 2024 +0000

    Change remove image to check by type instead of check by names

commit 08c2ebad1532fd6c34ac04efb94a268db9862d4f
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 12:33:02 2024 +0000

    Rename hallubench gpt output path

commit aefbd3c6856584135e2dcbe13381db0e0780f063
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 09:32:52 2024 +0000

    Fix hallusionbench gpt json saving path

commit b9aebc3ff3b122d6d4a81bd2f28e86b2c390c505
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:51:13 2024 +0000

    Resolve conflict

commit c9daa91f2576de69af73c80e263afb085ecd8288
Merge: 9cf86fa 93534dc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:37:21 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit b1c4c88b9b36e02e9ed738ff9217d98a5ef2117b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:55:03 2024 +0000

    Record gpt response in eval info

commit b35bc4a6c8fd6b4b2a68bb3054878807b8b92281
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:49:01 2024 +0000

    Fix refcocog dataset path

commit 2a45079
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 7bdab7a
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit '90f42f0876a4914c5ac0d213b9dffbfb4797ff62'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit '4afec3303a0a7ed27a8265565343bf2851b9e4c7'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit 'c144b75f0c9145a625b2bbdef5123ed81e343a11'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit d3dfd94
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* Fix small bugs in list_with_num

* Revise list_with_num model args

* Dev/readme rm rolling (#60)

* remove log_likelyhood_rolling

* Update time efficiency benchmark in README.md

* add task guide

---------

Co-authored-by: jzhang38 <a1286225768@gmail.com>
Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>

* Remove unnecessary code and update dependencies

* Fix logging utils bug on wandb grouping

* Add reproduce envs

* Squashed commit of the following:

commit 556b12620379d79c9ed5ddba0856063b498f917c
Merge: 2475639 f89a736
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 22:12:12 2024 +0800

    Merge branch 'main' into kc/final_fix

commit 9509a782c9e9824273cefb1dc9671c92b887697d
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 22:11:04 2024 +0800

    Add reproduce envs

commit 0bff98b
Author: kcz358 <92624596+kcz358@users.noreply.github.com>
Date:   Sun Mar 3 21:19:15 2024 +0800

    [Fix] wandb group logging missing columns (#61)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

    * Update README.md with new features and installation instructions

    * Update supported models and datasets

    * Delete otter.py file

    * Fix capitalization in README.md

    * Update image sizes and add new features

    * Refactor README.md to improve readability and add new features

    * Add description for lmms-eval in README.md

    * Update accelerator support in README.md

    * Update lmms-eval README with improved description and additional features

    * Update README.md with improved task grouping description

    * change `Otter-AI/MME` to `lmms-lab/MME`

    * Update README.md

    * Update README.md

    * Remove unused code in mme.yaml

    * Squashed commit of the following:

    commit 2a45079
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Thu Feb 29 13:40:02 2024 +0800

        Dev/py add models (#57)

        * add instructblip

        * minicpm_v

        * remove <image> from qwen-vl

        * speed up postprocessing

        * Optimize build context speed

        ---------

        Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 7bdab7a
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Wed Feb 28 14:49:07 2024 +0800

        Pufanyi/flickr30k refractor (#56)

        * refactor vizwizvqa task

        * Delete vqav2_test and vqav2_val YAML files

        * Refactor vqav2_process_results functions

        * Add a pack for vqav2

        * refactor okvqa

        * roll back vizwiz_vqa

        * Fix exact_match calculation in ok_vqa_process_results

        * Update OKVQA dataset name in readme

        * add model_specific_prompt_kwargs

        * add model_specific_prompt_kwargs to vizwiz_vqa

        * add model_specific_prompt_kwargs for vqav2

        * lint

        * fix a small bug for eval_logger

        * Refactor make_table function to display points as "  -  " if value is None

        * Merge commit '90f42f0876a4914c5ac0d213b9dffbfb4797ff62'

        * Refactor ok_vqa_aggreate_submissions function

        * Merge commit '4afec3303a0a7ed27a8265565343bf2851b9e4c7'

        * Refactor VQA submission file saving

        * Update file utils

        * Merge commit 'c144b75f0c9145a625b2bbdef5123ed81e343a11'

        * Refactor file path handling and submission generation

        * OKVQA path

        * vizwizvqa file

        * pack cmmmu

        * fix a small metric bug for cmmmu

        * Add higher_is_better flag to submission metric

        * Add CMMMU dataset to README.md

        * Add logging and refactor submission file generation in docvqa utils.py

        * pack docvqa

        * add traceback to print detailed error

        * Refactor docvqa_test_aggregate_results to accept additional arguments

        * Add metric check in evaluator.py and update test.yaml and val.yaml

        * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

        * merge textvqa

        * textvqa

        * Modify submission file generation for COCO test results

        * Update test result storage path

        * update coco cap file name

        * Update COCO 2017 Caption dataset name

        * ferret

        * Add Ferret dataset

        * Refactor hb_doc_to_text function to include model-specific prompts

        * Add IconQA and its subtasks

        * Refactor image list creation in doc_to_visual function

        * Add process_results function to default template

        * Update process_results function in iconqa utils.py

        * refactor flickr30k

        * change aggregation function

        * Fix formatting issues and update logging message

        * Fix llava can not handle only text question (no visuals)

        * Fix qwen can not handle no image question (no visuals)

        * Add fuyu prepare accelerator scripts

        * refactor mme

        * naming consistency

        * aggregation_submissions consistency

        * flickr30k naming consistency

        * remove submissions for mme

        * remove unused submission function

        * Refactor infovqa_test.yaml and infovqa_val.yaml

        * Refactor code for improved readability and maintainability

        * stvqa

        * remane sqa

        * Update lmms_eval textcaps files and utils.py

        * Update default prompt for text captions

        * Refactor textcaps_aggregation_result function

        * Add generate_submission_file function and update mathvista_aggregate_results signature

        * Update nocaps_test.yaml and nocaps_val.yaml

        * refractor internal_eval

        * Add internal evaluation datasets

        * pack multidocvqa

        * mmvet

        * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

        * Refractor llava wild

        * Refractor llava-bench-coco

        * Add JSON file generation for gpt evaluation details

        * mmmu

        * Remove MMBench English and Chinese tasks

        * Remove unnecessary return statement in mmbench_aggregate_test_results function

        * Fix distributed process group initialization

        * Update dataset paths and group names in mmbench test configs

        * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

        * Add torch module import

        * lint

        * Remove IconQA dataset from README.md

        * Add Multi-DocVQA and its submodules

        * Add new datasets and update task names

        * Refactor flickr_aggregation_result function to accept additional arguments

        * Add timeout kwargs in Accelerator constructor

        * Add encoding to be utf-8 for cmmmu

        * Fix llava try and catch, remove torch.distributed.init in main

        * Ds prepare script for llava

        ---------

        Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit d3dfd94
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Feb 27 22:52:07 2024 +0800

        [Wandb Logger] add models, and args to wandb tables. (#55)

        * Refactor logging in lmms_eval package

        * Refactor variable names in lmms_eval package

    * add llava main in pyproject

    * Update README.md

    * Remove unnecessary dependencies and add specific version for llava_repr

    * Add dependencies for llava_repr***

    * Update README.md

    * add some docs on models and command line commands

    * remove some lines

    * typo

    * Update model_guide.md

    * Update model_guide.md

    * Update README.md

    * Update README.md

    * Update README.md

    * Fix refcocog dataset path

    * Record gpt response in eval info

    * Resolve conflict

    * Fix hallusionbench gpt json saving path

    * Rename hallubench gpt output path

    * Change remove image to check by type instead of check by names

    * More robust check by type

    * Remove unnecessary img in data

    * Forcing an empty commit.

    * Testing

    * Delete unnecessary things

    * Fix seedbench2 image issue in doc_to_text

    * Add conditional exclude for internal eval

    * Fix small bugs in list_with_num

    * Revise list_with_num model args

    * Fix logging utils bug on wandb grouping

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Co-authored-by: jzhang38 <a1286225768@gmail.com>

commit 7c4501a32bbb415ba7e62e93194b37ba9a435cf5
Merge: 83358a4 5e1c9c7
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 07:25:48 2024 +0000

    Merge branch 'main' into kc/final_fix

commit 5c419f9fa23616a63a0bd584f18e509bb7704b50
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 07:23:19 2024 +0000

    Fix logging utils bug on wandb grouping

commit 0010d0a
Author: kcz358 <92624596+kcz358@users.noreply.github.com>
Date:   Sun Mar 3 13:01:11 2024 +0800

    [Fix] refcocog dataset path, record gpt prompt in internal eval, build context issue (#59)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

    * Update README.md with new features and installation instructions

    * Update supported models and datasets

    * Delete otter.py file

    * Fix capitalization in README.md

    * Update image sizes and add new features

    * Refactor README.md to improve readability and add new features

    * Add description for lmms-eval in README.md

    * Update accelerator support in README.md

    * Update lmms-eval README with improved description and additional features

    * Update README.md with improved task grouping description

    * change `Otter-AI/MME` to `lmms-lab/MME`

    * Update README.md

    * Update README.md

    * Remove unused code in mme.yaml

    * Squashed commit of the following:

    commit 2a45079
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Thu Feb 29 13:40:02 2024 +0800

        Dev/py add models (#57)

        * add instructblip

        * minicpm_v

        * remove <image> from qwen-vl

        * speed up postprocessing

        * Optimize build context speed

        ---------

        Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 7bdab7a
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Wed Feb 28 14:49:07 2024 +0800

        Pufanyi/flickr30k refractor (#56)

        * refactor vizwizvqa task

        * Delete vqav2_test and vqav2_val YAML files

        * Refactor vqav2_process_results functions

        * Add a pack for vqav2

        * refactor okvqa

        * roll back vizwiz_vqa

        * Fix exact_match calculation in ok_vqa_process_results

        * Update OKVQA dataset name in readme

        * add model_specific_prompt_kwargs

        * add model_specific_prompt_kwargs to vizwiz_vqa

        * add model_specific_prompt_kwargs for vqav2

        * lint

        * fix a small bug for eval_logger

        * Refactor make_table function to display points as "  -  " if value is None

        * Merge commit '90f42f0876a4914c5ac0d213b9dffbfb4797ff62'

        * Refactor ok_vqa_aggreate_submissions function

        * Merge commit '4afec3303a0a7ed27a8265565343bf2851b9e4c7'

        * Refactor VQA submission file saving

        * Update file utils

        * Merge commit 'c144b75f0c9145a625b2bbdef5123ed81e343a11'

        * Refactor file path handling and submission generation

        * OKVQA path

        * vizwizvqa file

        * pack cmmmu

        * fix a small metric bug for cmmmu

        * Add higher_is_better flag to submission metric

        * Add CMMMU dataset to README.md

        * Add logging and refactor submission file generation in docvqa utils.py

        * pack docvqa

        * add traceback to print detailed error

        * Refactor docvqa_test_aggregate_results to accept additional arguments

        * Add metric check in evaluator.py and update test.yaml and val.yaml

        * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

        * merge textvqa

        * textvqa

        * Modify submission file generation for COCO test results

        * Update test result storage path

        * update coco cap file name

        * Update COCO 2017 Caption dataset name

        * ferret

        * Add Ferret dataset

        * Refactor hb_doc_to_text function to include model-specific prompts

        * Add IconQA and its subtasks

        * Refactor image list creation in doc_to_visual function

        * Add process_results function to default template

        * Update process_results function in iconqa utils.py

        * refactor flickr30k

        * change aggregation function

        * Fix formatting issues and update logging message

        * Fix llava can not handle only text question (no visuals)

        * Fix qwen can not handle no image question (no visuals)

        * Add fuyu prepare accelerator scripts

        * refactor mme

        * naming consistency

        * aggregation_submissions consistency

        * flickr30k naming consistency

        * remove submissions for mme

        * remove unused submission function

        * Refactor infovqa_test.yaml and infovqa_val.yaml

        * Refactor code for improved readability and maintainability

        * stvqa

        * remane sqa

        * Update lmms_eval textcaps files and utils.py

        * Update default prompt for text captions

        * Refactor textcaps_aggregation_result function

        * Add generate_submission_file function and update mathvista_aggregate_results signature

        * Update nocaps_test.yaml and nocaps_val.yaml

        * refractor internal_eval

        * Add internal evaluation datasets

        * pack multidocvqa

        * mmvet

        * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

        * Refractor llava wild

        * Refractor llava-bench-coco

        * Add JSON file generation for gpt evaluation details

        * mmmu

        * Remove MMBench English and Chinese tasks

        * Remove unnecessary return statement in mmbench_aggregate_test_results function

        * Fix distributed process group initialization

        * Update dataset paths and group names in mmbench test configs

        * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

        * Add torch module import

        * lint

        * Remove IconQA dataset from README.md

        * Add Multi-DocVQA and its submodules

        * Add new datasets and update task names

        * Refactor flickr_aggregation_result function to accept additional arguments

        * Add timeout kwargs in Accelerator constructor

        * Add encoding to be utf-8 for cmmmu

        * Fix llava try and catch, remove torch.distributed.init in main

        * Ds prepare script for llava

        ---------

        Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit d3dfd94
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Feb 27 22:52:07 2024 +0800

        [Wandb Logger] add models, and args to wandb tables. (#55)

        * Refactor logging in lmms_eval package

        * Refactor variable names in lmms_eval package

    * add llava main in pyproject

    * Update README.md

    * Remove unnecessary dependencies and add specific version for llava_repr

    * Add dependencies for llava_repr***

    * Update README.md

    * add some docs on models and command line commands

    * remove some lines

    * typo

    * Update model_guide.md

    * Update model_guide.md

    * Update README.md

    * Update README.md

    * Update README.md

    * Fix refcocog dataset path

    * Record gpt response in eval info

    * Resolve conflict

    * Fix hallusionbench gpt json saving path

    * Rename hallubench gpt output path

    * Change remove image to check by type instead of check by names

    * More robust check by type

    * Remove unnecessary img in data

    * Forcing an empty commit.

    * Testing

    * Delete unnecessary things

    * Fix seedbench2 image issue in doc_to_text

    * Add conditional exclude for internal eval

    * Fix small bugs in list_with_num

    * Revise list_with_num model args

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Co-authored-by: jzhang38 <a1286225768@gmail.com>

commit b2ca65d1f12d84ae7a37ecc81f760901389a1af0
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 05:58:08 2024 +0000

    Revise list_with_num model args

commit a262ea1720b2c02839d21dad2a7618bc80725f18
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 05:09:15 2024 +0000

    Fix small bugs in list_with_num

commit 2fbeafc882c80242a10381abc67629d5d8b7071a
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:49:36 2024 +0000

    Add conditional exclude for internal eval

commit f188052450bed2f3a30ab6f9a6f7eb844a64cb33
Merge: a3cae8e ffb9eb2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:24:29 2024 +0000

    Merge branch 'dev/readme' into kc/final_fix

commit baef5905505892593fe783beb18a2de20991d6af
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 02:47:31 2024 +0000

    Fix seedbench2 image issue in doc_to_text

commit 11b46f3b701b79b361dd5175a263e4d89bd07fb5
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:32:49 2024 +0000

    Delete unnecessary things

commit 0982de2e7a2310429e51ec7828886fd49953f716
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:31:42 2024 +0000

    Testing

commit f840ed80f4ae467fff62b61844854a3a9e8ec8a5
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:29:30 2024 +0000

    Forcing an empty commit.

commit 80db78f600d07011188983637c94da84b9475fbf
Merge: 786f2b5 1700786
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:56 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 676229de870b8d465cef08867cd272a4b696e630
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:20 2024 +0000

    Remove unnecessary img in data

commit d293b96fb3537fea85f10f216d762abf35e05e8d
Merge: 4240785 888c1c1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:41:24 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 01bbd010590d6b7f105525580209191a1d6d5232
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:40:51 2024 +0000

    More robust check by type

commit 66595ebc073ff9431f2400006196c0645be58ea4
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:00:57 2024 +0000

    Change remove image to check by type instead of check by names

commit 08c2ebad1532fd6c34ac04efb94a268db9862d4f
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 12:33:02 2024 +0000

    Rename hallubench gpt output path

commit aefbd3c6856584135e2dcbe13381db0e0780f063
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 09:32:52 2024 +0000

    Fix hallusionbench gpt json saving path

commit b9aebc3ff3b122d6d4a81bd2f28e86b2c390c505
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:51:13 2024 +0000

    Resolve conflict

commit c9daa91f2576de69af73c80e263afb085ecd8288
Merge: 9cf86fa 93534dc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:37:21 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit b1c4c88b9b36e02e9ed738ff9217d98a5ef2117b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:55:03 2024 +0000

    Record gpt response in eval info

commit b35bc4a6c8fd6b4b2a68bb3054878807b8b92281
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:49:01 2024 +0000

    Fix refcocog dataset path

commit 2a45079
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 7bdab7a
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit '90f42f0876a4914c5ac0d213b9dffbfb4797ff62'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit '4afec3303a0a7ed27a8265565343bf2851b9e4c7'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit 'c144b75f0c9145a625b2bbdef5123ed81e343a11'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit d3dfd94
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* Update commands.md

* Add repr_scripts for reference

* Add timeout for gpt4V

* Remove unnecessary dependencies

* Add reproduce into readme

* Revise seedbench process_result

* Fix exclude dc hardcode postprocess logic error

* Fix metric repeat issue

* Update dataset runtime and add environment info

* Revise val submission file saving path

* Put the correct query into the gpt extraction

* Update sleep time in utils.py

* update

---------

Co-authored-by: Bo Li <drluodian@gmail.com>
Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
Co-authored-by: jzhang38 <a1286225768@gmail.com>
Luodian added a commit that referenced this pull request Apr 16, 2024
* Refactor logging in lmms_eval package

* Refactor variable names in lmms_eval package

* Update README.md with new features and installation instructions

* Update supported models and datasets

* Delete otter.py file

* Fix capitalization in README.md

* Update image sizes and add new features

* Refactor README.md to improve readability and add new features

* Add description for lmms-eval in README.md

* Update accelerator support in README.md

* Update lmms-eval README with improved description and additional features

* Update README.md with improved task grouping description

* change `Otter-AI/MME` to `lmms-lab/MME`

* Update README.md

* Update README.md

* Remove unused code in mme.yaml

* Squashed commit of the following:

commit 6b20902
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 21050ba
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit ba0e7f5
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* add llava main in pyproject

* Update README.md

* Remove unnecessary dependencies and add specific version for llava_repr

* Add dependencies for llava_repr***

* Update README.md

* add some docs on models and command line commands

* remove some lines

* typo

* Update model_guide.md

* Update model_guide.md

* Update README.md

* Update README.md

* Update README.md

* Fix refcocog dataset path

* Record gpt response in eval info

* Resolve conflict

* Fix hallusionbench gpt json saving path

* Rename hallubench gpt output path

* Change remove image to check by type instead of check by names

* More robust check by type

* Add timeout to API requests

* Remove unnecessary img in data

* Forcing an empty commit.

* Testing

* Delete unnecessary things

* Fix error logging in get_chat_response function

* Fix seedbench2 image issue in doc_to_text

* Add conditional exclude for internal eval

* Squashed commit of the following:

commit faf9cf65cf5b1e036ee3a74428e8bb1490e8b2eb
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:49:36 2024 +0000

    Add conditional exclude for internal eval

commit e3729eb925b718a44b6eb225ef9b41c7fd2408e0
Merge: a3cae8e ffb9eb2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:24:29 2024 +0000

    Merge branch 'dev/readme' into kc/final_fix

commit 50b697a7ae93b0547484e1cd753722c1d2513349
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 02:47:31 2024 +0000

    Fix seedbench2 image issue in doc_to_text

commit 17425b5dce41cf67b96c5875139b57d6c7a423df
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:32:49 2024 +0000

    Delete unnecessary things

commit 1bc17d54e79e79d11419ba89e7d8e55bc8cfa21b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:31:42 2024 +0000

    Testing

commit a20bbc30ab576d3e2a587c70af1b7c06575bcd8b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:29:30 2024 +0000

    Forcing an empty commit.

commit e2b657694b888ef59b9f896415e7c4c82497e7bf
Merge: 786f2b5 1700786
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:56 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 6447d521842b9f83f5119cdcd7714c8f6053ca73
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:20 2024 +0000

    Remove unnecessary img in data

commit 8ac333a2e9ebbe6318d536b6589f767f71fbc092
Merge: 4240785 888c1c1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:41:24 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 9e542ce049f68f49a237be165e3ad9cde7408ac0
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:40:51 2024 +0000

    More robust check by type

commit f90ccf7b94b130e118b4eca321f68b81e7ab5850
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:00:57 2024 +0000

    Change remove image to check by type instead of check by names

commit f651a77707a4c723ebffb07f2a87743bf42ecea7
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 12:33:02 2024 +0000

    Rename hallubench gpt output path

commit a683559c704806b7abde5e4c8355f556f3e65866
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 09:32:52 2024 +0000

    Fix hallusionbench gpt json saving path

commit 8e246e2466f3dd14a5e34f720269d7991a6dcf6b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:51:13 2024 +0000

    Resolve conflict

commit 67f00dc4652d09c662e5202ff7e5fbf7bebcdaf6
Merge: 9cf86fa 93534dc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:37:21 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 53b7a845fe8412a652905101ec036c84e77a20c2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:55:03 2024 +0000

    Record gpt response in eval info

commit 920b4112c4508e9a8afe824678958f2e78189e4e
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:49:01 2024 +0000

    Fix refcocog dataset path

commit 6b20902
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 21050ba
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit ba0e7f5
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* Fix small bugs in list_with_num

* Revise list_with_num model args

* Dev/readme rm rolling (#60)

* remove log_likelyhood_rolling

* Update time efficiency benchmark in README.md

* add task guide

---------

Co-authored-by: jzhang38 <a1286225768@gmail.com>
Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>

* Remove unnecessary code and update dependencies

* Fix logging utils bug on wandb grouping

* Add reproduce envs

* Squashed commit of the following:

commit 74fff73053b88a90d0f4229a9c748256080fea08
Merge: 2475639 f89a736
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 22:12:12 2024 +0800

    Merge branch 'main' into kc/final_fix

commit 0c640a636e3882859a17e30a5c3504850a3d02d6
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 22:11:04 2024 +0800

    Add reproduce envs

commit 7f2b2c3
Author: kcz358 <92624596+kcz358@users.noreply.github.com>
Date:   Sun Mar 3 21:19:15 2024 +0800

    [Fix] wandb group logging missing columns (#61)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

    * Update README.md with new features and installation instructions

    * Update supported models and datasets

    * Delete otter.py file

    * Fix capitalization in README.md

    * Update image sizes and add new features

    * Refactor README.md to improve readability and add new features

    * Add description for lmms-eval in README.md

    * Update accelerator support in README.md

    * Update lmms-eval README with improved description and additional features

    * Update README.md with improved task grouping description

    * change `Otter-AI/MME` to `lmms-lab/MME`

    * Update README.md

    * Update README.md

    * Remove unused code in mme.yaml

    * Squashed commit of the following:

    commit 6b20902
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Thu Feb 29 13:40:02 2024 +0800

        Dev/py add models (#57)

        * add instructblip

        * minicpm_v

        * remove <image> from qwen-vl

        * speed up postprocessing

        * Optimize build context speed

        ---------

        Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 21050ba
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Wed Feb 28 14:49:07 2024 +0800

        Pufanyi/flickr30k refractor (#56)

        * refactor vizwizvqa task

        * Delete vqav2_test and vqav2_val YAML files

        * Refactor vqav2_process_results functions

        * Add a pack for vqav2

        * refactor okvqa

        * roll back vizwiz_vqa

        * Fix exact_match calculation in ok_vqa_process_results

        * Update OKVQA dataset name in readme

        * add model_specific_prompt_kwargs

        * add model_specific_prompt_kwargs to vizwiz_vqa

        * add model_specific_prompt_kwargs for vqav2

        * lint

        * fix a small bug for eval_logger

        * Refactor make_table function to display points as "  -  " if value is None

        * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

        * Refactor ok_vqa_aggreate_submissions function

        * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

        * Refactor VQA submission file saving

        * Update file utils

        * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

        * Refactor file path handling and submission generation

        * OKVQA path

        * vizwizvqa file

        * pack cmmmu

        * fix a small metric bug for cmmmu

        * Add higher_is_better flag to submission metric

        * Add CMMMU dataset to README.md

        * Add logging and refactor submission file generation in docvqa utils.py

        * pack docvqa

        * add traceback to print detailed error

        * Refactor docvqa_test_aggregate_results to accept additional arguments

        * Add metric check in evaluator.py and update test.yaml and val.yaml

        * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

        * merge textvqa

        * textvqa

        * Modify submission file generation for COCO test results

        * Update test result storage path

        * update coco cap file name

        * Update COCO 2017 Caption dataset name

        * ferret

        * Add Ferret dataset

        * Refactor hb_doc_to_text function to include model-specific prompts

        * Add IconQA and its subtasks

        * Refactor image list creation in doc_to_visual function

        * Add process_results function to default template

        * Update process_results function in iconqa utils.py

        * refactor flickr30k

        * change aggregation function

        * Fix formatting issues and update logging message

        * Fix llava can not handle only text question (no visuals)

        * Fix qwen can not handle no image question (no visuals)

        * Add fuyu prepare accelerator scripts

        * refactor mme

        * naming consistency

        * aggregation_submissions consistency

        * flickr30k naming consistency

        * remove submissions for mme

        * remove unused submission function

        * Refactor infovqa_test.yaml and infovqa_val.yaml

        * Refactor code for improved readability and maintainability

        * stvqa

        * remane sqa

        * Update lmms_eval textcaps files and utils.py

        * Update default prompt for text captions

        * Refactor textcaps_aggregation_result function

        * Add generate_submission_file function and update mathvista_aggregate_results signature

        * Update nocaps_test.yaml and nocaps_val.yaml

        * refractor internal_eval

        * Add internal evaluation datasets

        * pack multidocvqa

        * mmvet

        * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

        * Refractor llava wild

        * Refractor llava-bench-coco

        * Add JSON file generation for gpt evaluation details

        * mmmu

        * Remove MMBench English and Chinese tasks

        * Remove unnecessary return statement in mmbench_aggregate_test_results function

        * Fix distributed process group initialization

        * Update dataset paths and group names in mmbench test configs

        * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

        * Add torch module import

        * lint

        * Remove IconQA dataset from README.md

        * Add Multi-DocVQA and its submodules

        * Add new datasets and update task names

        * Refactor flickr_aggregation_result function to accept additional arguments

        * Add timeout kwargs in Accelerator constructor

        * Add encoding to be utf-8 for cmmmu

        * Fix llava try and catch, remove torch.distributed.init in main

        * Ds prepare script for llava

        ---------

        Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit ba0e7f5
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Feb 27 22:52:07 2024 +0800

        [Wandb Logger] add models, and args to wandb tables. (#55)

        * Refactor logging in lmms_eval package

        * Refactor variable names in lmms_eval package

    * add llava main in pyproject

    * Update README.md

    * Remove unnecessary dependencies and add specific version for llava_repr

    * Add dependencies for llava_repr***

    * Update README.md

    * add some docs on models and command line commands

    * remove some lines

    * typo

    * Update model_guide.md

    * Update model_guide.md

    * Update README.md

    * Update README.md

    * Update README.md

    * Fix refcocog dataset path

    * Record gpt response in eval info

    * Resolve conflict

    * Fix hallusionbench gpt json saving path

    * Rename hallubench gpt output path

    * Change remove image to check by type instead of check by names

    * More robust check by type

    * Remove unnecessary img in data

    * Forcing an empty commit.

    * Testing

    * Delete unnecessary things

    * Fix seedbench2 image issue in doc_to_text

    * Add conditional exclude for internal eval

    * Fix small bugs in list_with_num

    * Revise list_with_num model args

    * Fix logging utils bug on wandb grouping

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Co-authored-by: jzhang38 <a1286225768@gmail.com>

commit bebff9fad2a60bc0ac52ddc430e5d9e4e0ef6c24
Merge: 83358a4 5e1c9c7
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 07:25:48 2024 +0000

    Merge branch 'main' into kc/final_fix

commit 5042bb0c2ed4f830dda6bcd14231b1f8763aa95f
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 07:23:19 2024 +0000

    Fix logging utils bug on wandb grouping

commit c82042b
Author: kcz358 <92624596+kcz358@users.noreply.github.com>
Date:   Sun Mar 3 13:01:11 2024 +0800

    [Fix] refcocog dataset path, record gpt prompt in internal eval, build context issue (#59)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

    * Update README.md with new features and installation instructions

    * Update supported models and datasets

    * Delete otter.py file

    * Fix capitalization in README.md

    * Update image sizes and add new features

    * Refactor README.md to improve readability and add new features

    * Add description for lmms-eval in README.md

    * Update accelerator support in README.md

    * Update lmms-eval README with improved description and additional features

    * Update README.md with improved task grouping description

    * change `Otter-AI/MME` to `lmms-lab/MME`

    * Update README.md

    * Update README.md

    * Remove unused code in mme.yaml

    * Squashed commit of the following:

    commit 6b20902
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Thu Feb 29 13:40:02 2024 +0800

        Dev/py add models (#57)

        * add instructblip

        * minicpm_v

        * remove <image> from qwen-vl

        * speed up postprocessing

        * Optimize build context speed

        ---------

        Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 21050ba
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Wed Feb 28 14:49:07 2024 +0800

        Pufanyi/flickr30k refractor (#56)

        * refactor vizwizvqa task

        * Delete vqav2_test and vqav2_val YAML files

        * Refactor vqav2_process_results functions

        * Add a pack for vqav2

        * refactor okvqa

        * roll back vizwiz_vqa

        * Fix exact_match calculation in ok_vqa_process_results

        * Update OKVQA dataset name in readme

        * add model_specific_prompt_kwargs

        * add model_specific_prompt_kwargs to vizwiz_vqa

        * add model_specific_prompt_kwargs for vqav2

        * lint

        * fix a small bug for eval_logger

        * Refactor make_table function to display points as "  -  " if value is None

        * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

        * Refactor ok_vqa_aggreate_submissions function

        * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

        * Refactor VQA submission file saving

        * Update file utils

        * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

        * Refactor file path handling and submission generation

        * OKVQA path

        * vizwizvqa file

        * pack cmmmu

        * fix a small metric bug for cmmmu

        * Add higher_is_better flag to submission metric

        * Add CMMMU dataset to README.md

        * Add logging and refactor submission file generation in docvqa utils.py

        * pack docvqa

        * add traceback to print detailed error

        * Refactor docvqa_test_aggregate_results to accept additional arguments

        * Add metric check in evaluator.py and update test.yaml and val.yaml

        * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

        * merge textvqa

        * textvqa

        * Modify submission file generation for COCO test results

        * Update test result storage path

        * update coco cap file name

        * Update COCO 2017 Caption dataset name

        * ferret

        * Add Ferret dataset

        * Refactor hb_doc_to_text function to include model-specific prompts

        * Add IconQA and its subtasks

        * Refactor image list creation in doc_to_visual function

        * Add process_results function to default template

        * Update process_results function in iconqa utils.py

        * refactor flickr30k

        * change aggregation function

        * Fix formatting issues and update logging message

        * Fix llava can not handle only text question (no visuals)

        * Fix qwen can not handle no image question (no visuals)

        * Add fuyu prepare accelerator scripts

        * refactor mme

        * naming consistency

        * aggregation_submissions consistency

        * flickr30k naming consistency

        * remove submissions for mme

        * remove unused submission function

        * Refactor infovqa_test.yaml and infovqa_val.yaml

        * Refactor code for improved readability and maintainability

        * stvqa

        * remane sqa

        * Update lmms_eval textcaps files and utils.py

        * Update default prompt for text captions

        * Refactor textcaps_aggregation_result function

        * Add generate_submission_file function and update mathvista_aggregate_results signature

        * Update nocaps_test.yaml and nocaps_val.yaml

        * refractor internal_eval

        * Add internal evaluation datasets

        * pack multidocvqa

        * mmvet

        * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

        * Refractor llava wild

        * Refractor llava-bench-coco

        * Add JSON file generation for gpt evaluation details

        * mmmu

        * Remove MMBench English and Chinese tasks

        * Remove unnecessary return statement in mmbench_aggregate_test_results function

        * Fix distributed process group initialization

        * Update dataset paths and group names in mmbench test configs

        * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

        * Add torch module import

        * lint

        * Remove IconQA dataset from README.md

        * Add Multi-DocVQA and its submodules

        * Add new datasets and update task names

        * Refactor flickr_aggregation_result function to accept additional arguments

        * Add timeout kwargs in Accelerator constructor

        * Add encoding to be utf-8 for cmmmu

        * Fix llava try and catch, remove torch.distributed.init in main

        * Ds prepare script for llava

        ---------

        Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit ba0e7f5
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Feb 27 22:52:07 2024 +0800

        [Wandb Logger] add models, and args to wandb tables. (#55)

        * Refactor logging in lmms_eval package

        * Refactor variable names in lmms_eval package

    * add llava main in pyproject

    * Update README.md

    * Remove unnecessary dependencies and add specific version for llava_repr

    * Add dependencies for llava_repr***

    * Update README.md

    * add some docs on models and command line commands

    * remove some lines

    * typo

    * Update model_guide.md

    * Update model_guide.md

    * Update README.md

    * Update README.md

    * Update README.md

    * Fix refcocog dataset path

    * Record gpt response in eval info

    * Resolve conflict

    * Fix hallusionbench gpt json saving path

    * Rename hallubench gpt output path

    * Change remove image to check by type instead of check by names

    * More robust check by type

    * Remove unnecessary img in data

    * Forcing an empty commit.

    * Testing

    * Delete unnecessary things

    * Fix seedbench2 image issue in doc_to_text

    * Add conditional exclude for internal eval

    * Fix small bugs in list_with_num

    * Revise list_with_num model args

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Co-authored-by: jzhang38 <a1286225768@gmail.com>

commit d78a3d7a53f5285a7eac39ce8f04e9854fdb3e73
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 05:58:08 2024 +0000

    Revise list_with_num model args

commit 8eefaec8489d48613de9395eb8e8150224985e01
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 05:09:15 2024 +0000

    Fix small bugs in list_with_num

commit faf9cf65cf5b1e036ee3a74428e8bb1490e8b2eb
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:49:36 2024 +0000

    Add conditional exclude for internal eval

commit e3729eb925b718a44b6eb225ef9b41c7fd2408e0
Merge: a3cae8e ffb9eb2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:24:29 2024 +0000

    Merge branch 'dev/readme' into kc/final_fix

commit 50b697a7ae93b0547484e1cd753722c1d2513349
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 02:47:31 2024 +0000

    Fix seedbench2 image issue in doc_to_text

commit 17425b5dce41cf67b96c5875139b57d6c7a423df
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:32:49 2024 +0000

    Delete unnecessary things

commit 1bc17d54e79e79d11419ba89e7d8e55bc8cfa21b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:31:42 2024 +0000

    Testing

commit a20bbc30ab576d3e2a587c70af1b7c06575bcd8b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:29:30 2024 +0000

    Forcing an empty commit.

commit e2b657694b888ef59b9f896415e7c4c82497e7bf
Merge: 786f2b5 1700786
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:56 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 6447d521842b9f83f5119cdcd7714c8f6053ca73
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:20 2024 +0000

    Remove unnecessary img in data

commit 8ac333a2e9ebbe6318d536b6589f767f71fbc092
Merge: 4240785 888c1c1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:41:24 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 9e542ce049f68f49a237be165e3ad9cde7408ac0
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:40:51 2024 +0000

    More robust check by type

commit f90ccf7b94b130e118b4eca321f68b81e7ab5850
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:00:57 2024 +0000

    Change remove image to check by type instead of check by names

commit f651a77707a4c723ebffb07f2a87743bf42ecea7
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 12:33:02 2024 +0000

    Rename hallubench gpt output path

commit a683559c704806b7abde5e4c8355f556f3e65866
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 09:32:52 2024 +0000

    Fix hallusionbench gpt json saving path

commit 8e246e2466f3dd14a5e34f720269d7991a6dcf6b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:51:13 2024 +0000

    Resolve conflict

commit 67f00dc4652d09c662e5202ff7e5fbf7bebcdaf6
Merge: 9cf86fa 93534dc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:37:21 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 53b7a845fe8412a652905101ec036c84e77a20c2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:55:03 2024 +0000

    Record gpt response in eval info

commit 920b4112c4508e9a8afe824678958f2e78189e4e
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:49:01 2024 +0000

    Fix refcocog dataset path

commit 6b20902
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 21050ba
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit ba0e7f5
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* Update commands.md

* Add repr_scripts for reference

* Add timeout for gpt4V

* Remove unnecessary dependencies

* Add reproduce into readme

* Revise seedbench process_result

* Fix exclude dc hardcode postprocess logic error

* Fix metric repeat issue

* Update dataset runtime and add environment info

* Revise val submission file saving path

* Put the correct query into the gpt extraction

* Update sleep time in utils.py

* update

---------

Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>
Co-authored-by: jzhang38 <a1286225768@gmail.com>
Co-authored-by: kcz358 <kaichenzhang358@outlook.com>
Luodian added a commit that referenced this pull request Apr 16, 2024
* Refactor logging in lmms_eval package

* Refactor variable names in lmms_eval package

* Update README.md with new features and installation instructions

* Update supported models and datasets

* Delete otter.py file

* Fix capitalization in README.md

* Update image sizes and add new features

* Refactor README.md to improve readability and add new features

* Add description for lmms-eval in README.md

* Update accelerator support in README.md

* Update lmms-eval README with improved description and additional features

* Update README.md with improved task grouping description

* change `Otter-AI/MME` to `lmms-lab/MME`

* Update README.md

* Update README.md

* Remove unused code in mme.yaml

* Squashed commit of the following:

commit 2a45079
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 7bdab7a
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit '90f42f0876a4914c5ac0d213b9dffbfb4797ff62'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit '4afec3303a0a7ed27a8265565343bf2851b9e4c7'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit 'c144b75f0c9145a625b2bbdef5123ed81e343a11'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit d3dfd94
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* add llava main in pyproject

* Update README.md

* Remove unnecessary dependencies and add specific version for llava_repr

* Add dependencies for llava_repr***

* Update README.md

* add some docs on models and command line commands

* remove some lines

* typo

* Update model_guide.md

* Update model_guide.md

* Update README.md

* Update README.md

* Update README.md

* Fix refcocog dataset path

* Record gpt response in eval info

* Resolve conflict

* Fix hallusionbench gpt json saving path

* Rename hallubench gpt output path

* Change remove image to check by type instead of check by names

* More robust check by type

* Add timeout to API requests

* Remove unnecessary img in data

* Forcing an empty commit.

* Testing

* Delete unnecessary things

* Fix error logging in get_chat_response function

* Fix seedbench2 image issue in doc_to_text

* Add conditional exclude for internal eval

* Squashed commit of the following:

commit 2fbeafc882c80242a10381abc67629d5d8b7071a
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:49:36 2024 +0000

    Add conditional exclude for internal eval

commit f188052450bed2f3a30ab6f9a6f7eb844a64cb33
Merge: a3cae8e ffb9eb2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:24:29 2024 +0000

    Merge branch 'dev/readme' into kc/final_fix

commit baef5905505892593fe783beb18a2de20991d6af
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 02:47:31 2024 +0000

    Fix seedbench2 image issue in doc_to_text

commit 11b46f3b701b79b361dd5175a263e4d89bd07fb5
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:32:49 2024 +0000

    Delete unnecessary things

commit 0982de2e7a2310429e51ec7828886fd49953f716
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:31:42 2024 +0000

    Testing

commit f840ed80f4ae467fff62b61844854a3a9e8ec8a5
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:29:30 2024 +0000

    Forcing an empty commit.

commit 80db78f600d07011188983637c94da84b9475fbf
Merge: 786f2b5 1700786
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:56 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 676229de870b8d465cef08867cd272a4b696e630
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:20 2024 +0000

    Remove unnecessary img in data

commit d293b96fb3537fea85f10f216d762abf35e05e8d
Merge: 4240785 888c1c1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:41:24 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 01bbd010590d6b7f105525580209191a1d6d5232
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:40:51 2024 +0000

    More robust check by type

commit 66595ebc073ff9431f2400006196c0645be58ea4
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:00:57 2024 +0000

    Change remove image to check by type instead of check by names

commit 08c2ebad1532fd6c34ac04efb94a268db9862d4f
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 12:33:02 2024 +0000

    Rename hallubench gpt output path

commit aefbd3c6856584135e2dcbe13381db0e0780f063
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 09:32:52 2024 +0000

    Fix hallusionbench gpt json saving path

commit b9aebc3ff3b122d6d4a81bd2f28e86b2c390c505
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:51:13 2024 +0000

    Resolve conflict

commit c9daa91f2576de69af73c80e263afb085ecd8288
Merge: 9cf86fa 93534dc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:37:21 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit b1c4c88b9b36e02e9ed738ff9217d98a5ef2117b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:55:03 2024 +0000

    Record gpt response in eval info

commit b35bc4a6c8fd6b4b2a68bb3054878807b8b92281
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:49:01 2024 +0000

    Fix refcocog dataset path

commit 2a45079
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 7bdab7a
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit '90f42f0876a4914c5ac0d213b9dffbfb4797ff62'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit '4afec3303a0a7ed27a8265565343bf2851b9e4c7'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit 'c144b75f0c9145a625b2bbdef5123ed81e343a11'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit d3dfd94
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* Fix small bugs in list_with_num

* Revise list_with_num model args

* Dev/readme rm rolling (#60)

* remove log_likelyhood_rolling

* Update time efficiency benchmark in README.md

* add task guide

---------

Co-authored-by: jzhang38 <a1286225768@gmail.com>
Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>

* Remove unnecessary code and update dependencies

* Fix logging utils bug on wandb grouping

* Add reproduce envs

* Squashed commit of the following:

commit 556b12620379d79c9ed5ddba0856063b498f917c
Merge: 2475639 f89a736
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 22:12:12 2024 +0800

    Merge branch 'main' into kc/final_fix

commit 9509a782c9e9824273cefb1dc9671c92b887697d
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 22:11:04 2024 +0800

    Add reproduce envs

commit 0bff98b
Author: kcz358 <92624596+kcz358@users.noreply.github.com>
Date:   Sun Mar 3 21:19:15 2024 +0800

    [Fix] wandb group logging missing columns (#61)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

    * Update README.md with new features and installation instructions

    * Update supported models and datasets

    * Delete otter.py file

    * Fix capitalization in README.md

    * Update image sizes and add new features

    * Refactor README.md to improve readability and add new features

    * Add description for lmms-eval in README.md

    * Update accelerator support in README.md

    * Update lmms-eval README with improved description and additional features

    * Update README.md with improved task grouping description

    * change `Otter-AI/MME` to `lmms-lab/MME`

    * Update README.md

    * Update README.md

    * Remove unused code in mme.yaml

    * Squashed commit of the following:

    commit 2a45079
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Thu Feb 29 13:40:02 2024 +0800

        Dev/py add models (#57)

        * add instructblip

        * minicpm_v

        * remove <image> from qwen-vl

        * speed up postprocessing

        * Optimize build context speed

        ---------

        Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 7bdab7a
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Wed Feb 28 14:49:07 2024 +0800

        Pufanyi/flickr30k refractor (#56)

        * refactor vizwizvqa task

        * Delete vqav2_test and vqav2_val YAML files

        * Refactor vqav2_process_results functions

        * Add a pack for vqav2

        * refactor okvqa

        * roll back vizwiz_vqa

        * Fix exact_match calculation in ok_vqa_process_results

        * Update OKVQA dataset name in readme

        * add model_specific_prompt_kwargs

        * add model_specific_prompt_kwargs to vizwiz_vqa

        * add model_specific_prompt_kwargs for vqav2

        * lint

        * fix a small bug for eval_logger

        * Refactor make_table function to display points as "  -  " if value is None

        * Merge commit '90f42f0876a4914c5ac0d213b9dffbfb4797ff62'

        * Refactor ok_vqa_aggreate_submissions function

        * Merge commit '4afec3303a0a7ed27a8265565343bf2851b9e4c7'

        * Refactor VQA submission file saving

        * Update file utils

        * Merge commit 'c144b75f0c9145a625b2bbdef5123ed81e343a11'

        * Refactor file path handling and submission generation

        * OKVQA path

        * vizwizvqa file

        * pack cmmmu

        * fix a small metric bug for cmmmu

        * Add higher_is_better flag to submission metric

        * Add CMMMU dataset to README.md

        * Add logging and refactor submission file generation in docvqa utils.py

        * pack docvqa

        * add traceback to print detailed error

        * Refactor docvqa_test_aggregate_results to accept additional arguments

        * Add metric check in evaluator.py and update test.yaml and val.yaml

        * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

        * merge textvqa

        * textvqa

        * Modify submission file generation for COCO test results

        * Update test result storage path

        * update coco cap file name

        * Update COCO 2017 Caption dataset name

        * ferret

        * Add Ferret dataset

        * Refactor hb_doc_to_text function to include model-specific prompts

        * Add IconQA and its subtasks

        * Refactor image list creation in doc_to_visual function

        * Add process_results function to default template

        * Update process_results function in iconqa utils.py

        * refactor flickr30k

        * change aggregation function

        * Fix formatting issues and update logging message

        * Fix llava can not handle only text question (no visuals)

        * Fix qwen can not handle no image question (no visuals)

        * Add fuyu prepare accelerator scripts

        * refactor mme

        * naming consistency

        * aggregation_submissions consistency

        * flickr30k naming consistency

        * remove submissions for mme

        * remove unused submission function

        * Refactor infovqa_test.yaml and infovqa_val.yaml

        * Refactor code for improved readability and maintainability

        * stvqa

        * remane sqa

        * Update lmms_eval textcaps files and utils.py

        * Update default prompt for text captions

        * Refactor textcaps_aggregation_result function

        * Add generate_submission_file function and update mathvista_aggregate_results signature

        * Update nocaps_test.yaml and nocaps_val.yaml

        * refractor internal_eval

        * Add internal evaluation datasets

        * pack multidocvqa

        * mmvet

        * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

        * Refractor llava wild

        * Refractor llava-bench-coco

        * Add JSON file generation for gpt evaluation details

        * mmmu

        * Remove MMBench English and Chinese tasks

        * Remove unnecessary return statement in mmbench_aggregate_test_results function

        * Fix distributed process group initialization

        * Update dataset paths and group names in mmbench test configs

        * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

        * Add torch module import

        * lint

        * Remove IconQA dataset from README.md

        * Add Multi-DocVQA and its submodules

        * Add new datasets and update task names

        * Refactor flickr_aggregation_result function to accept additional arguments

        * Add timeout kwargs in Accelerator constructor

        * Add encoding to be utf-8 for cmmmu

        * Fix llava try and catch, remove torch.distributed.init in main

        * Ds prepare script for llava

        ---------

        Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit d3dfd94
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Feb 27 22:52:07 2024 +0800

        [Wandb Logger] add models, and args to wandb tables. (#55)

        * Refactor logging in lmms_eval package

        * Refactor variable names in lmms_eval package

    * add llava main in pyproject

    * Update README.md

    * Remove unnecessary dependencies and add specific version for llava_repr

    * Add dependencies for llava_repr***

    * Update README.md

    * add some docs on models and command line commands

    * remove some lines

    * typo

    * Update model_guide.md

    * Update model_guide.md

    * Update README.md

    * Update README.md

    * Update README.md

    * Fix refcocog dataset path

    * Record gpt response in eval info

    * Resolve conflict

    * Fix hallusionbench gpt json saving path

    * Rename hallubench gpt output path

    * Change remove image to check by type instead of check by names

    * More robust check by type

    * Remove unnecessary img in data

    * Forcing an empty commit.

    * Testing

    * Delete unnecessary things

    * Fix seedbench2 image issue in doc_to_text

    * Add conditional exclude for internal eval

    * Fix small bugs in list_with_num

    * Revise list_with_num model args

    * Fix logging utils bug on wandb grouping

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Co-authored-by: jzhang38 <a1286225768@gmail.com>

commit 7c4501a32bbb415ba7e62e93194b37ba9a435cf5
Merge: 83358a4 5e1c9c7
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 07:25:48 2024 +0000

    Merge branch 'main' into kc/final_fix

commit 5c419f9fa23616a63a0bd584f18e509bb7704b50
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 07:23:19 2024 +0000

    Fix logging utils bug on wandb grouping

commit 0010d0a
Author: kcz358 <92624596+kcz358@users.noreply.github.com>
Date:   Sun Mar 3 13:01:11 2024 +0800

    [Fix] refcocog dataset path, record gpt prompt in internal eval, build context issue (#59)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

    * Update README.md with new features and installation instructions

    * Update supported models and datasets

    * Delete otter.py file

    * Fix capitalization in README.md

    * Update image sizes and add new features

    * Refactor README.md to improve readability and add new features

    * Add description for lmms-eval in README.md

    * Update accelerator support in README.md

    * Update lmms-eval README with improved description and additional features

    * Update README.md with improved task grouping description

    * change `Otter-AI/MME` to `lmms-lab/MME`

    * Update README.md

    * Update README.md

    * Remove unused code in mme.yaml

    * Squashed commit of the following:

    commit 2a45079
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Thu Feb 29 13:40:02 2024 +0800

        Dev/py add models (#57)

        * add instructblip

        * minicpm_v

        * remove <image> from qwen-vl

        * speed up postprocessing

        * Optimize build context speed

        ---------

        Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 7bdab7a
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Wed Feb 28 14:49:07 2024 +0800

        Pufanyi/flickr30k refractor (#56)

        * refactor vizwizvqa task

        * Delete vqav2_test and vqav2_val YAML files

        * Refactor vqav2_process_results functions

        * Add a pack for vqav2

        * refactor okvqa

        * roll back vizwiz_vqa

        * Fix exact_match calculation in ok_vqa_process_results

        * Update OKVQA dataset name in readme

        * add model_specific_prompt_kwargs

        * add model_specific_prompt_kwargs to vizwiz_vqa

        * add model_specific_prompt_kwargs for vqav2

        * lint

        * fix a small bug for eval_logger

        * Refactor make_table function to display points as "  -  " if value is None

        * Merge commit '90f42f0876a4914c5ac0d213b9dffbfb4797ff62'

        * Refactor ok_vqa_aggreate_submissions function

        * Merge commit '4afec3303a0a7ed27a8265565343bf2851b9e4c7'

        * Refactor VQA submission file saving

        * Update file utils

        * Merge commit 'c144b75f0c9145a625b2bbdef5123ed81e343a11'

        * Refactor file path handling and submission generation

        * OKVQA path

        * vizwizvqa file

        * pack cmmmu

        * fix a small metric bug for cmmmu

        * Add higher_is_better flag to submission metric

        * Add CMMMU dataset to README.md

        * Add logging and refactor submission file generation in docvqa utils.py

        * pack docvqa

        * add traceback to print detailed error

        * Refactor docvqa_test_aggregate_results to accept additional arguments

        * Add metric check in evaluator.py and update test.yaml and val.yaml

        * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

        * merge textvqa

        * textvqa

        * Modify submission file generation for COCO test results

        * Update test result storage path

        * update coco cap file name

        * Update COCO 2017 Caption dataset name

        * ferret

        * Add Ferret dataset

        * Refactor hb_doc_to_text function to include model-specific prompts

        * Add IconQA and its subtasks

        * Refactor image list creation in doc_to_visual function

        * Add process_results function to default template

        * Update process_results function in iconqa utils.py

        * refactor flickr30k

        * change aggregation function

        * Fix formatting issues and update logging message

        * Fix llava can not handle only text question (no visuals)

        * Fix qwen can not handle no image question (no visuals)

        * Add fuyu prepare accelerator scripts

        * refactor mme

        * naming consistency

        * aggregation_submissions consistency

        * flickr30k naming consistency

        * remove submissions for mme

        * remove unused submission function

        * Refactor infovqa_test.yaml and infovqa_val.yaml

        * Refactor code for improved readability and maintainability

        * stvqa

        * remane sqa

        * Update lmms_eval textcaps files and utils.py

        * Update default prompt for text captions

        * Refactor textcaps_aggregation_result function

        * Add generate_submission_file function and update mathvista_aggregate_results signature

        * Update nocaps_test.yaml and nocaps_val.yaml

        * refractor internal_eval

        * Add internal evaluation datasets

        * pack multidocvqa

        * mmvet

        * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

        * Refractor llava wild

        * Refractor llava-bench-coco

        * Add JSON file generation for gpt evaluation details

        * mmmu

        * Remove MMBench English and Chinese tasks

        * Remove unnecessary return statement in mmbench_aggregate_test_results function

        * Fix distributed process group initialization

        * Update dataset paths and group names in mmbench test configs

        * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

        * Add torch module import

        * lint

        * Remove IconQA dataset from README.md

        * Add Multi-DocVQA and its submodules

        * Add new datasets and update task names

        * Refactor flickr_aggregation_result function to accept additional arguments

        * Add timeout kwargs in Accelerator constructor

        * Add encoding to be utf-8 for cmmmu

        * Fix llava try and catch, remove torch.distributed.init in main

        * Ds prepare script for llava

        ---------

        Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit d3dfd94
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Feb 27 22:52:07 2024 +0800

        [Wandb Logger] add models, and args to wandb tables. (#55)

        * Refactor logging in lmms_eval package

        * Refactor variable names in lmms_eval package

    * add llava main in pyproject

    * Update README.md

    * Remove unnecessary dependencies and add specific version for llava_repr

    * Add dependencies for llava_repr***

    * Update README.md

    * add some docs on models and command line commands

    * remove some lines

    * typo

    * Update model_guide.md

    * Update model_guide.md

    * Update README.md

    * Update README.md

    * Update README.md

    * Fix refcocog dataset path

    * Record gpt response in eval info

    * Resolve conflict

    * Fix hallusionbench gpt json saving path

    * Rename hallubench gpt output path

    * Change remove image to check by type instead of check by names

    * More robust check by type

    * Remove unnecessary img in data

    * Forcing an empty commit.

    * Testing

    * Delete unnecessary things

    * Fix seedbench2 image issue in doc_to_text

    * Add conditional exclude for internal eval

    * Fix small bugs in list_with_num

    * Revise list_with_num model args

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Co-authored-by: jzhang38 <a1286225768@gmail.com>

commit b2ca65d1f12d84ae7a37ecc81f760901389a1af0
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 05:58:08 2024 +0000

    Revise list_with_num model args

commit a262ea1720b2c02839d21dad2a7618bc80725f18
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 05:09:15 2024 +0000

    Fix small bugs in list_with_num

commit 2fbeafc882c80242a10381abc67629d5d8b7071a
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:49:36 2024 +0000

    Add conditional exclude for internal eval

commit f188052450bed2f3a30ab6f9a6f7eb844a64cb33
Merge: a3cae8e ffb9eb2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:24:29 2024 +0000

    Merge branch 'dev/readme' into kc/final_fix

commit baef5905505892593fe783beb18a2de20991d6af
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 02:47:31 2024 +0000

    Fix seedbench2 image issue in doc_to_text

commit 11b46f3b701b79b361dd5175a263e4d89bd07fb5
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:32:49 2024 +0000

    Delete unnecessary things

commit 0982de2e7a2310429e51ec7828886fd49953f716
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:31:42 2024 +0000

    Testing

commit f840ed80f4ae467fff62b61844854a3a9e8ec8a5
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:29:30 2024 +0000

    Forcing an empty commit.

commit 80db78f600d07011188983637c94da84b9475fbf
Merge: 786f2b5 1700786
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:56 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 676229de870b8d465cef08867cd272a4b696e630
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:20 2024 +0000

    Remove unnecessary img in data

commit d293b96fb3537fea85f10f216d762abf35e05e8d
Merge: 4240785 888c1c1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:41:24 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 01bbd010590d6b7f105525580209191a1d6d5232
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:40:51 2024 +0000

    More robust check by type

commit 66595ebc073ff9431f2400006196c0645be58ea4
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:00:57 2024 +0000

    Change remove image to check by type instead of check by names

commit 08c2ebad1532fd6c34ac04efb94a268db9862d4f
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 12:33:02 2024 +0000

    Rename hallubench gpt output path

commit aefbd3c6856584135e2dcbe13381db0e0780f063
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 09:32:52 2024 +0000

    Fix hallusionbench gpt json saving path

commit b9aebc3ff3b122d6d4a81bd2f28e86b2c390c505
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:51:13 2024 +0000

    Resolve conflict

commit c9daa91f2576de69af73c80e263afb085ecd8288
Merge: 9cf86fa 93534dc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:37:21 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit b1c4c88b9b36e02e9ed738ff9217d98a5ef2117b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:55:03 2024 +0000

    Record gpt response in eval info

commit b35bc4a6c8fd6b4b2a68bb3054878807b8b92281
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:49:01 2024 +0000

    Fix refcocog dataset path

commit 2a45079
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 7bdab7a
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit '90f42f0876a4914c5ac0d213b9dffbfb4797ff62'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit '4afec3303a0a7ed27a8265565343bf2851b9e4c7'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit 'c144b75f0c9145a625b2bbdef5123ed81e343a11'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit d3dfd94
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* Update commands.md

* Add repr_scripts for reference

* Add timeout for gpt4V

* Remove unnecessary dependencies

* Add reproduce into readme

* Revise seedbench process_result

* Fix exclude dc hardcode postprocess logic error

* Fix metric repeat issue

* Update dataset runtime and add environment info

* Revise val submission file saving path

* Put the correct query into the gpt extraction

* Update sleep time in utils.py

* update

---------

Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>
Co-authored-by: jzhang38 <a1286225768@gmail.com>
Co-authored-by: kcz358 <kaichenzhang358@outlook.com>
@Luodian Luodian merged commit b46239c into EvolvingLMMs-Lab:main May 2, 2024
1 check passed
Luodian added a commit that referenced this pull request Jul 9, 2024
commit 050b2c3
Merge: 74facb4 ef30651
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Jun 18 13:13:38 2024 +0800

    Merge pull request #114 from zjysteven/add-tinyllava

    add tinyllava

commit ef30651
Author: Jingyang Zhang <jingyang.zhang@duke.edu>
Date:   Mon Jun 17 17:57:02 2024 -0400

    fix typo

commit 9bab677
Merge: dbfb238 74facb4
Author: Jingyang Zhang <jingyang.zhang@duke.edu>
Date:   Sun Jun 16 10:56:05 2024 -0400

    Merge branch 'EvolvingLMMs-Lab:main' into add-tinyllava

commit 74facb4
Merge: 8ba192f d5df72d
Author: Li Bo <drluodian@gmail.com>
Date:   Sun Jun 16 17:59:19 2024 +0800

    Merge pull request #118 from teowu/main

    Fix the potential risk by PR #117

commit d5df72d
Merge: 5bf59ed 8ba192f
Author: Teo (Timothy) Wu Haoning <38696372+teowu@users.noreply.github.com>
Date:   Sun Jun 16 15:32:13 2024 +0800

    Merge branch 'EvolvingLMMs-Lab:main' into main

commit 5bf59ed
Author: teowu <realtimothyhwu@gmail.com>
Date:   Sun Jun 16 07:27:28 2024 +0000

    fix #117, allow auto download with tar format videos

commit 98b3955
Merge: a056f11 be9dada
Author: teowu <realtimothyhwu@gmail.com>
Date:   Sun Jun 16 07:25:07 2024 +0000

    Merge branch 'main' of https://github.com/teowu/lmms-eval into main

commit a056f11
Author: teowu <realtimothyhwu@gmail.com>
Date:   Sun Jun 16 07:23:54 2024 +0000

    fix #117, allow auto download with tar format videos

commit 8ba192f
Merge: 7cc2890 be9dada
Author: Li Bo <drluodian@gmail.com>
Date:   Sat Jun 15 17:30:59 2024 +0800

    Merge pull request #117 from teowu/main

    LongVideoBench for LMMs-Eval

commit be9dada
Merge: 62ea8ce 7cc2890
Author: Teo (Timothy) Wu Haoning <38696372+teowu@users.noreply.github.com>
Date:   Sat Jun 15 16:39:20 2024 +0800

    Merge pull request #1 from EvolvingLMMs-Lab/main

    Merge pull request #113 from teowu/main

commit 62ea8ce
Author: teowu <realtimothyhwu@gmail.com>
Date:   Sat Jun 15 08:30:11 2024 +0000

    LongVideoBench support: image LMMs (idefics2, phi3) and video LMMs (LLaVA-Next-Video-34B)

commit 7cc2890
Merge: 4bc7224 ea14cd4
Author: Li Bo <drluodian@gmail.com>
Date:   Sat Jun 15 14:10:22 2024 +0800

    Merge pull request #113 from teowu/main

    Q-Bench, Q-Bench2, A-Bench

commit dbfb238
Author: Jingyang <jingyang.zhang@duke.edu>
Date:   Fri Jun 14 16:20:42 2024 -0400

    add tinyllava

commit ea14cd4
Author: teowu <realtimothyhwu@gmail.com>
Date:   Fri Jun 14 15:01:52 2024 +0000

    Add qbench, qbench2, abench; fix phi3v as its current implementation does not support multi-image

commit 4bc7224
Merge: 2797987 bf14cb8
Author: Li Bo <drluodian@gmail.com>
Date:   Fri Jun 14 02:14:43 2024 +0800

    Merge pull request #111 from XinrunDu/main

    add II-Bench

commit bf14cb8
Author: XinrunDu <duxinrun2000@gmail.com>
Date:   Thu Jun 13 09:37:02 2024 +0000

    fix dataset_path

commit 6248113
Author: XinrunDu <duxinrun2000@gmail.com>
Date:   Thu Jun 13 09:32:06 2024 +0000

    add II-Bench

commit 2797987
Merge: 63d82f1 66d4bb2
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 13 11:14:47 2024 +0800

    Merge pull request #109 from EvolvingLMMs-Lab/pufanyi/update_version

    [Small Update] Update the version of LMMs-Eval

commit 66d4bb2
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Thu Jun 13 11:13:00 2024 +0800

    update version

commit 63d82f1
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 13 11:04:32 2024 +0800

    Update README.md

commit 44a3379
Merge: 5ed0035 0ce46d0
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 13 04:00:12 2024 +0800

    Merge pull request #105 from tianyu-z/main

    Include VCR

commit 0ce46d0
Author: Suyuchen <suyuchen.wang@umontreal.ca>
Date:   Wed Jun 12 15:56:34 2024 -0400

    update README.md

commit 46a88d8
Merge: 47b13b9 5ed0035
Author: Suyuchen <suyuchen.wang@umontreal.ca>
Date:   Wed Jun 12 15:50:26 2024 -0400

    merged readme.md

commit 47b13b9
Author: Suyuchen <suyuchen.wang@umontreal.ca>
Date:   Wed Jun 12 15:30:52 2024 -0400

    update aggregation function for vcr_wiki

commit 5ed0035
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 13 03:21:42 2024 +0800

    Update README.md

commit ed88068
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 13 03:13:59 2024 +0800

    Update README.md

commit fea3806
Merge: d99a24a 05dc8e8
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 13 03:11:49 2024 +0800

    Merge pull request #108 from EvolvingLMMs-Lab/internal_main_dev

    [Upgrade to v0.2] Embracing Video Evaluations with LMMs-Eval

commit 05dc8e8
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jun 12 15:56:04 2024 +0000

    chore: Update lmms-eval to support video evaluations for LLaVA models

commit cbeee20
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jun 12 15:50:30 2024 +0000

    chore: Update lmms-eval to support video evaluations for LLaVA models

commit f00d549
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jun 12 15:46:33 2024 +0000

    Update image alignment in README.md

commit 3415633
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jun 12 15:43:16 2024 +0000

    Update llava conv_template in lmms_eval/models/llava.py

commit 50575a9
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jun 12 15:39:03 2024 +0000

    chore: Update lmms-eval to support video evaluations for LLaVA models

commit c9b2252
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jun 12 15:33:48 2024 +0000

    Bump version to 0.2.0.dev0

commit 465bd42
Merge: e43bd84 d99a24a
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jun 12 15:04:25 2024 +0000

    Merge branch 'main' of https://github.com/EvolvingLMMs-Lab/lmms-eval into internal_main_dev

commit e43bd84
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jun 12 14:54:06 2024 +0000

    chore: Remove unnecessary files and code related to live_bench and sft_eval tasks

commit d99a24a
Merge: 374590b a66003b
Author: Li Bo <drluodian@gmail.com>
Date:   Wed Jun 12 19:45:57 2024 +0800

    Merge pull request #107 from AtsuMiyai/new_task/upd_update

    update gpt-3.5-turbo version

commit a66003b
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Wed Jun 12 17:05:17 2024 +0900

    update gpt-3.5-turbo version

commit ee91f27
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Wed Jun 12 16:50:53 2024 +0900

    update gpt-3.5-turbo version

commit 326b969
Author: tianyu-z <zhangtianyupro@gmail.com>
Date:   Mon Jun 10 20:07:40 2024 -0400

    include std and confidence interval

commit cd050d4
Author: Suyuchen <suyuchen.wang@umontreal.ca>
Date:   Mon Jun 10 18:49:47 2024 -0400

    update vcr_wiki tasks in README.md

commit 205721e
Author: Suyuchen <suyuchen.wang@umontreal.ca>
Date:   Mon Jun 10 18:43:15 2024 -0400

    update vcr_wiki tasks

commit db8e718
Author: tianyu-z <zhangtianyupro@gmail.com>
Date:   Mon Jun 10 16:13:58 2024 -0400

    include the try-except logic for spacy

commit 427dabb
Author: Suyuchen <suyuchen.wang@umontreal.ca>
Date:   Mon Jun 10 15:51:05 2024 -0400

    add crossed_text to vcr_wiki output

commit 043b483
Author: tianyu-z <zhangtianyupro@gmail.com>
Date:   Mon Jun 10 15:47:00 2024 -0400

    switch logic

commit e1f04db
Author: tianyu-z <zhangtianyupro@gmail.com>
Date:   Mon Jun 10 02:38:21 2024 -0400

    modify the form of VCR

commit 96e8d98
Author: tianyu-z <zhangtianyupro@gmail.com>
Date:   Mon Jun 10 00:10:30 2024 -0400

    init include vcr

commit 374590b
Merge: 504685e cb3b9ce
Author: Kaichen Zhang - NTU <kaichenzhang358@outlook.com>
Date:   Fri Jun 7 20:25:48 2024 +0800

    Merge pull request #101 from Gumpest/main

    Update conbench in README

commit 504685e
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 6 15:42:15 2024 +0800

    Update README.md

commit cb3b9ce
Merge: c9793b3 67b64ea
Author: Yuan Zhang <56063339+Gumpest@users.noreply.github.com>
Date:   Thu Jun 6 11:22:24 2024 +0800

    Merge branch 'EvolvingLMMs-Lab:main' into main

commit c9793b3
Author: Yuan Zhang <gump_well_done@163.com>
Date:   Thu Jun 6 11:21:05 2024 +0800

    update README

commit 67b64ea
Merge: 8ee7848 5fd6845
Author: Li Bo <drluodian@gmail.com>
Date:   Wed Jun 5 23:12:58 2024 +0800

    Merge pull request #100 from Gumpest/main

    add Conbench

commit 5fd6845
Author: Yuan Zhang <gump_well_done@163.com>
Date:   Wed Jun 5 21:52:31 2024 +0800

    add conbench

commit 8ee7848
Merge: 747e197 6fefaf7
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Jun 4 17:09:33 2024 +0800

    Merge pull request #95 from AtsuMiyai/new_task/upd

    add MM-UPD

commit 747e197
Merge: 4854a34 0584307
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Jun 4 17:09:04 2024 +0800

    Merge pull request #97 from CaraJ7/update

    Add MathVerse in README.md

commit 6fefaf7
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Tue Jun 4 17:36:39 2024 +0900

    update utils.py for leaderboard submission

commit 5f4fe36
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Sun Jun 2 23:28:27 2024 +0900

    slightly change query_prompt for the reproduction

commit 0584307
Author: CaraJ7 <1350074492@qq.com>
Date:   Sun Jun 2 17:05:28 2024 +0800

    Add MathVerse in README.md

commit 0581ab3
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Fri May 31 16:09:45 2024 +0900

    merge model_specific_prompt_kwargs and dataset_name into each task yaml

commit 4854a34
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Sat May 4 19:23:39 2024 +0800

    Group MMMU images into one image (#83)

    * update

    * update font

    * Add matplotlib.font_manager import in utils.py

    * Refactor font handling in add_order_label function in utils.py

    * group mmmu

    ---------

    Co-authored-by: Li Bo <drluodian@gmail.com>

commit d224794
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Wed May 29 15:15:59 2024 +0900

    add upd

commit 453e793
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Wed May 29 15:03:30 2024 +0900

    add upd

commit 909edd6
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Wed May 29 12:52:21 2024 +0900

    add upd

commit 7c1ac97
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Wed May 29 12:50:32 2024 +0900

    add upd

commit 811301c
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Wed May 29 12:46:58 2024 +0900

    add upd

commit 71401ba
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Wed May 29 12:41:21 2024 +0900

    add upd

commit 24dc435
Author: Bo Li <drluodian@gmail.com>
Date:   Mon May 27 10:17:32 2024 +0000

    fix compatibility issue of older version llava

commit 616edf4
Author: Bo Li <drluodian@gmail.com>
Date:   Mon May 27 09:32:26 2024 +0000

    [Fix] import issues of multilingual llava and olympiadbench

commit 4c5a99e
Merge: 45c05b2 b05c3e2
Author: Li Bo <drluodian@gmail.com>
Date:   Mon May 27 14:19:53 2024 +0800

    Merge pull request #87 from vfragoso/vifragos/phi3v

    Adding microsoft/Phi-3-vision-128k-instruct model.

commit b05c3e2
Author: Victor Fragoso <victor.fragoso@microsoft.com>
Date:   Fri May 24 16:36:37 2024 +0000

    Adding documentation of Phi3v class.

commit c200897
Author: Victor Fragoso <victor.fragoso@microsoft.com>
Date:   Fri May 24 16:25:02 2024 +0000

    Adding prompt arguments for Phi3v on MathVista-TestMini

commit 7f9fb6b
Author: Victor Fragoso <victor.fragoso@microsoft.com>
Date:   Fri May 24 13:24:16 2024 +0000

    Adding Phi3v model.

commit 45c05b2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu May 23 03:47:36 2024 +0000

    Set printing info for llava_hf to debug level

commit 53f013e
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu May 23 03:41:39 2024 +0000

    Fix pope random name in pope full

commit 22520a9
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu May 23 03:41:14 2024 +0000

    Add separated pope tasks by category

commit d1eefb1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu May 9 08:36:02 2024 +0000

    Update gitignore

commit b2b4dbd
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Mon May 20 07:45:11 2024 +0000

    Comment out Spice in caption task so that don't need to download stanford nlp model

commit 662f05c
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Mon May 20 03:13:13 2024 +0000

    Comment out parse result in xcomposer

commit 0932932
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu May 16 03:55:39 2024 +0000

    Fix instructblip qformer size mismatch and multi-images problem

commit 557a6a3
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu May 16 03:11:41 2024 +0000

    Remove redundant code in fuyu

commit 6aeb550
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu May 16 01:45:24 2024 +0000

    Fix idefics2 llava in the wild bugs

commit aea80e6
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Wed May 15 11:07:35 2024 +0000

    Better task list_with_num

commit 3c12a08
Author: Li Bo <drluodian@gmail.com>
Date:   Sat May 18 02:35:52 2024 +0800

    Update LICENSE

commit 82317a6
Author: Li Bo <drluodian@gmail.com>
Date:   Sat May 18 02:29:09 2024 +0800

    Update LICENSE

commit a8bba1c
Author: Li Bo <drluodian@gmail.com>
Date:   Sat May 18 02:28:03 2024 +0800

    Create LICENSE

commit caa5893
Merge: c094448 423b006
Author: Li Bo <drluodian@gmail.com>
Date:   Mon May 13 11:45:26 2024 +0800

    Merge pull request #73 from EvolvingLMMs-Lab/kc/qwen_vl_api

    [Feat] Add qwen vl api

commit c094448
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat May 11 06:11:19 2024 +0000

    Fix llava_hf image tokens number issue

commit 64f07e4
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu May 9 02:04:10 2024 +0000

    Fix endless warning for llava_hf generation

commit 8aaa828
Author: Bo Li <drluodian@gmail.com>
Date:   Thu May 2 06:13:56 2024 +0000

    Add model_name parameter to Llava constructor

commit 7847dc4
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Tue May 7 03:15:59 2024 +0000

    Parse result for llava_hf 1.6

commit 3e56b4f
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Tue May 7 03:09:56 2024 +0000

    Fix llava_hf generation for 1.6

commit fa3ff92
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Mon May 6 08:32:57 2024 +0000

    Fix llava conv template for llama3

commit 423b006
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun May 5 07:54:52 2024 +0000

    Add qwen vl api

commit b7fd7a9
Merge: 986139a c5a130b
Author: Li Bo <drluodian@gmail.com>
Date:   Sun May 5 13:19:48 2024 +0800

    Merge pull request #59 from EvolvingLMMs-Lab/add_idefics2

    add idefics2

commit 986139a
Merge: b46239c 8d3526c
Author: Li Bo <drluodian@gmail.com>
Date:   Fri May 3 01:18:18 2024 +0800

    Merge pull request #36 from cocoshe/main

    [Fix] repr llava doc

commit b46239c
Merge: bc69a74 373265f
Author: Li Bo <drluodian@gmail.com>
Date:   Fri May 3 01:17:34 2024 +0800

    Merge pull request #56 from gagan3012/main

    Multilingual LLava bench

commit bc69a74
Merge: eef3aeb 626e8a9
Author: Li Bo <drluodian@gmail.com>
Date:   Fri May 3 01:12:14 2024 +0800

    Merge pull request #70 from hunterheiden/hsh/new_task/WebSRC

    Bugfix: WebSRC should be token-level F1 NOT character-level

commit 626e8a9
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Thu May 2 09:31:03 2024 -0400

    Bugfix: WebSRC should be token-level F1 NOT character-level

commit eef3aeb
Merge: c4e9dd9 9bca441
Author: Li Bo <drluodian@gmail.com>
Date:   Thu May 2 14:38:17 2024 +0800

    Merge pull request #69 from hunterheiden/hsh/new_task/WebSRC

    [New Task] WebSRC (multimodal Q&A on web screenshots)

commit 9bca441
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Wed May 1 11:07:29 2024 -0400

    Add code to enable compilation of submission for WebSRC test split

commit 7687495
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Wed May 1 10:47:32 2024 -0400

    Draft and validate websrc eval on dev split

commit 4eebd3e
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Wed May 1 10:46:54 2024 -0400

    Update main README with new task names

commit 35fe80b
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Wed May 1 10:46:20 2024 -0400

    Draft README for WebSRC

commit 955bd06
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Tue Apr 30 10:16:21 2024 -0400

    Init webSRC

commit c4e9dd9
Merge: d8a3a99 319afcc
Author: Li Bo <drluodian@gmail.com>
Date:   Fri Apr 26 14:37:22 2024 +0800

    Merge pull request #63 from hunterheiden/hsh/new_task/screenspot

    New Task: ScreenSpot - Grounding (REC) and instruction generation (REG) on screens

commit 319afcc
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Thu Apr 25 11:44:34 2024 -0400

    slight update

commit 2f3811c
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Thu Apr 25 11:41:04 2024 -0400

    Add README file specific to ScreenSpot

commit 28962cb
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Wed Apr 24 11:52:33 2024 -0400

    Update README to reflect new tasks

commit e457cfb
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Tue Apr 23 18:33:16 2024 -0400

    Create ScreenSpot on clean branch

commit d8a3a99
Merge: 3dcd015 ed17129
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Apr 23 10:34:03 2024 +0800

    Merge pull request #61 from tupini07/patch-1

    Fix typo in Qwen-VL that was causing "reference before assignment"

commit ed17129
Author: Andrea Tupini <tupini07@gmail.com>
Date:   Mon Apr 22 14:56:41 2024 -0600

    refactor query construction for clarity

commit cd87420
Author: Andrea Tupini <tupini07@gmail.com>
Date:   Mon Apr 22 14:54:29 2024 -0600

    convert contexts to list if necessary and remove unnecessary construction of `questions`

commit 8557367
Author: Andrea Tupini <tupini07@gmail.com>
Date:   Mon Apr 22 14:47:33 2024 -0600

    Fix typo in qwen_vl that was causing "reference before assignment"

commit 3dcd015
Merge: 95df9fe 743673a
Author: Li Bo <drluodian@gmail.com>
Date:   Sat Apr 20 22:03:16 2024 +0800

    Merge pull request #60 from CaraJ7/main

    Add MathVerse

commit 743673a
Merge: c1a5472 95df9fe
Author: CaraJ7 <1350074492@qq.com>
Date:   Sat Apr 20 21:49:02 2024 +0800

    Merge branch 'main' of https://github.com/EvolvingLMMs-Lab/lmms-eval

commit c1a5472
Author: CaraJ7 <1350074492@qq.com>
Date:   Sat Apr 20 21:45:34 2024 +0800

    Add MathVerse

commit 373265f
Author: Gagan Bhatia <49101362+gagan3012@users.noreply.github.com>
Date:   Fri Apr 12 17:21:39 2024 -0700

    Add files via upload

commit d853051
Author: Gagan Bhatia <49101362+gagan3012@users.noreply.github.com>
Date:   Fri Apr 12 17:19:49 2024 -0700

    Create README.md

commit 8d3526c
Author: cocoshe <1228759711@qq.com>
Date:   Thu Mar 28 13:38:36 2024 +0800

    fix doc
Luodian added a commit that referenced this pull request Jul 9, 2024
commit 8f9d620
Author: Li Bo <drluodian@gmail.com>
Date:   Sun Jun 23 14:02:25 2024 +0800

    Update pyproject.toml

commit 6341b7c
Merge: fce85f1 903b042
Author: Li Bo <drluodian@gmail.com>
Date:   Sun Jun 23 14:02:02 2024 +0800

    Merge pull request #125 from EvolvingLMMs-Lab/dev/interleave

    [Model] aligned llava-interleave model results on video tasks

commit 903b042
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Jun 22 12:07:13 2024 +0000

    Remove unnecessary lines for video llava

commit d78ec86
Merge: ebe7217 fce85f1
Author: Li Bo <drluodian@gmail.com>
Date:   Sat Jun 22 13:57:31 2024 +0800

    Merge branch 'main' into dev/interleave

commit ebe7217
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Jun 22 02:57:08 2024 +0000

    Delete unnecessary lines

commit 120c474
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Jun 21 08:38:41 2024 +0000

    Revise model registry for llava_hf and longva

commit 7d6201f
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Jun 21 08:38:24 2024 +0000

    Add longva

commit 12f4806
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Jun 21 08:35:39 2024 +0000

    Remove unnecessary lines since use batched visuals now in llava

commit 12cea76
Author: Bo Li <drluodian@gmail.com>
Date:   Thu Jun 20 18:15:32 2024 +0000

    chore: Add loguru for logging in lmms_eval package

commit 8ef2474
Author: Bo Li <drluodian@gmail.com>
Date:   Thu Jun 20 12:11:03 2024 +0000

    chore: Remove unused models from lmms_eval package

commit af38885
Author: Bo Li <drluodian@gmail.com>
Date:   Thu Jun 20 12:07:09 2024 +0000

    chore: Handle ImportError when importing models

    Handle the ImportError exception when importing models in the lmms_eval package. This change adds a try-except block to catch the ImportError and print an error message indicating the failed import. This will help with troubleshooting and identifying any issues with the model imports.

commit fce85f1
Merge: dbe6329 d94f83c
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 20 20:02:12 2024 +0800

    Merge pull request #120 from EvolvingLMMs-Lab/pufanyi/hf_dataset_docs

    Add docs for datasets upload to HF

commit dbe6329
Author: choiszt <ls2001927@sohu.com>
Date:   Thu Jun 20 15:14:21 2024 +0800

    update ablation for videomme datasets

commit d94f83c
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 20 13:30:59 2024 +0800

    Update README.md

commit cab8159
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 20 13:30:29 2024 +0800

    Update README.md

commit 4587665
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu Jun 20 03:55:30 2024 +0000

    Add llava_hf back to registry

commit 3463651
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu Jun 20 03:54:33 2024 +0000

    Remove handling non-visual loop in llava

commit cb0d3f4
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Thu Jun 20 02:11:18 2024 +0800

    update readme

commit 813877b
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Wed Jun 19 15:37:52 2024 +0800

    to sh script

commit a14684b
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Wed Jun 19 15:37:04 2024 +0800

    lint

commit d0f8851
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Wed Jun 19 15:36:48 2024 +0800

    small fix

commit 63748e9
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Wed Jun 19 15:36:43 2024 +0800

    small fix

commit 7f1159a
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Wed Jun 19 15:35:05 2024 +0800

    update preparation

commit 19f9bd6
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Wed Jun 19 15:23:24 2024 +0800

    docs

commit ce6f889
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Wed Jun 19 15:04:16 2024 +0800

    tutorial

commit f513c52
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jun 19 06:51:19 2024 +0000

    chore: Update dependencies to fix potential risks and improve compatibility

commit efb5295
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Wed Jun 19 10:25:58 2024 +0800

    Release llava-wilder

commit 742651f
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Wed Jun 19 07:44:26 2024 +0800

    feat: Add support for auto downloading tar format videos

commit 511b625
Merge: 22a4958 050b2c3
Author: Bo Li <drluodian@gmail.com>
Date:   Tue Jun 18 17:01:03 2024 +0000

    Merge branch 'main' of https://github.com/EvolvingLMMs-Lab/lmms-eval

commit 050b2c3
Merge: 74facb4 ef30651
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Jun 18 13:13:38 2024 +0800

    Merge pull request #114 from zjysteven/add-tinyllava

    add tinyllava

commit ef30651
Author: Jingyang Zhang <jingyang.zhang@duke.edu>
Date:   Mon Jun 17 17:57:02 2024 -0400

    fix typo

commit 9bab677
Merge: dbfb238 74facb4
Author: Jingyang Zhang <jingyang.zhang@duke.edu>
Date:   Sun Jun 16 10:56:05 2024 -0400

    Merge branch 'EvolvingLMMs-Lab:main' into add-tinyllava

commit 74facb4
Merge: 8ba192f d5df72d
Author: Li Bo <drluodian@gmail.com>
Date:   Sun Jun 16 17:59:19 2024 +0800

    Merge pull request #118 from teowu/main

    Fix the potential risk by PR #117

commit d5df72d
Merge: 5bf59ed 8ba192f
Author: Teo (Timothy) Wu Haoning <38696372+teowu@users.noreply.github.com>
Date:   Sun Jun 16 15:32:13 2024 +0800

    Merge branch 'EvolvingLMMs-Lab:main' into main

commit 5bf59ed
Author: teowu <realtimothyhwu@gmail.com>
Date:   Sun Jun 16 07:27:28 2024 +0000

    fix #117, allow auto download with tar format videos

commit 98b3955
Merge: a056f11 be9dada
Author: teowu <realtimothyhwu@gmail.com>
Date:   Sun Jun 16 07:25:07 2024 +0000

    Merge branch 'main' of https://github.com/teowu/lmms-eval into main

commit a056f11
Author: teowu <realtimothyhwu@gmail.com>
Date:   Sun Jun 16 07:23:54 2024 +0000

    fix #117, allow auto download with tar format videos

commit 8ba192f
Merge: 7cc2890 be9dada
Author: Li Bo <drluodian@gmail.com>
Date:   Sat Jun 15 17:30:59 2024 +0800

    Merge pull request #117 from teowu/main

    LongVideoBench for LMMs-Eval

commit be9dada
Merge: 62ea8ce 7cc2890
Author: Teo (Timothy) Wu Haoning <38696372+teowu@users.noreply.github.com>
Date:   Sat Jun 15 16:39:20 2024 +0800

    Merge pull request #1 from EvolvingLMMs-Lab/main

    Merge pull request #113 from teowu/main

commit 62ea8ce
Author: teowu <realtimothyhwu@gmail.com>
Date:   Sat Jun 15 08:30:11 2024 +0000

    LongVideoBench support: image LMMs (idefics2, phi3) and video LMMs (LLaVA-Next-Video-34B)

commit 7cc2890
Merge: 4bc7224 ea14cd4
Author: Li Bo <drluodian@gmail.com>
Date:   Sat Jun 15 14:10:22 2024 +0800

    Merge pull request #113 from teowu/main

    Q-Bench, Q-Bench2, A-Bench

commit dbfb238
Author: Jingyang <jingyang.zhang@duke.edu>
Date:   Fri Jun 14 16:20:42 2024 -0400

    add tinyllava

commit ea14cd4
Author: teowu <realtimothyhwu@gmail.com>
Date:   Fri Jun 14 15:01:52 2024 +0000

    Add qbench, qbench2, abench; fix phi3v as its current implementation does not support multi-image

commit 4bc7224
Merge: 2797987 bf14cb8
Author: Li Bo <drluodian@gmail.com>
Date:   Fri Jun 14 02:14:43 2024 +0800

    Merge pull request #111 from XinrunDu/main

    add II-Bench

commit bf14cb8
Author: XinrunDu <duxinrun2000@gmail.com>
Date:   Thu Jun 13 09:37:02 2024 +0000

    fix dataset_path

commit 6248113
Author: XinrunDu <duxinrun2000@gmail.com>
Date:   Thu Jun 13 09:32:06 2024 +0000

    add II-Bench

commit 2797987
Merge: 63d82f1 66d4bb2
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 13 11:14:47 2024 +0800

    Merge pull request #109 from EvolvingLMMs-Lab/pufanyi/update_version

    [Small Update] Update the version of LMMs-Eval

commit 66d4bb2
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Thu Jun 13 11:13:00 2024 +0800

    update version

commit 63d82f1
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 13 11:04:32 2024 +0800

    Update README.md

commit 44a3379
Merge: 5ed0035 0ce46d0
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 13 04:00:12 2024 +0800

    Merge pull request #105 from tianyu-z/main

    Include VCR

commit 0ce46d0
Author: Suyuchen <suyuchen.wang@umontreal.ca>
Date:   Wed Jun 12 15:56:34 2024 -0400

    update README.md

commit 46a88d8
Merge: 47b13b9 5ed0035
Author: Suyuchen <suyuchen.wang@umontreal.ca>
Date:   Wed Jun 12 15:50:26 2024 -0400

    merged readme.md

commit 47b13b9
Author: Suyuchen <suyuchen.wang@umontreal.ca>
Date:   Wed Jun 12 15:30:52 2024 -0400

    update aggregation function for vcr_wiki

commit 5ed0035
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 13 03:21:42 2024 +0800

    Update README.md

commit ed88068
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 13 03:13:59 2024 +0800

    Update README.md

commit fea3806
Merge: d99a24a 05dc8e8
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 13 03:11:49 2024 +0800

    Merge pull request #108 from EvolvingLMMs-Lab/internal_main_dev

    [Upgrade to v0.2] Embracing Video Evaluations with LMMs-Eval

commit 05dc8e8
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jun 12 15:56:04 2024 +0000

    chore: Update lmms-eval to support video evaluations for LLaVA models

commit cbeee20
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jun 12 15:50:30 2024 +0000

    chore: Update lmms-eval to support video evaluations for LLaVA models

commit f00d549
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jun 12 15:46:33 2024 +0000

    Update image alignment in README.md

commit 3415633
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jun 12 15:43:16 2024 +0000

    Update llava conv_template in lmms_eval/models/llava.py

commit 50575a9
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jun 12 15:39:03 2024 +0000

    chore: Update lmms-eval to support video evaluations for LLaVA models

commit c9b2252
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jun 12 15:33:48 2024 +0000

    Bump version to 0.2.0.dev0

commit 465bd42
Merge: e43bd84 d99a24a
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jun 12 15:04:25 2024 +0000

    Merge branch 'main' of https://github.com/EvolvingLMMs-Lab/lmms-eval into internal_main_dev

commit e43bd84
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jun 12 14:54:06 2024 +0000

    chore: Remove unnecessary files and code related to live_bench and sft_eval tasks

commit d99a24a
Merge: 374590b a66003b
Author: Li Bo <drluodian@gmail.com>
Date:   Wed Jun 12 19:45:57 2024 +0800

    Merge pull request #107 from AtsuMiyai/new_task/upd_update

    update gpt-3.5-turbo version

commit a66003b
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Wed Jun 12 17:05:17 2024 +0900

    update gpt-3.5-turbo version

commit ee91f27
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Wed Jun 12 16:50:53 2024 +0900

    update gpt-3.5-turbo version

commit 326b969
Author: tianyu-z <zhangtianyupro@gmail.com>
Date:   Mon Jun 10 20:07:40 2024 -0400

    include std and confidence interval

commit cd050d4
Author: Suyuchen <suyuchen.wang@umontreal.ca>
Date:   Mon Jun 10 18:49:47 2024 -0400

    update vcr_wiki tasks in README.md

commit 205721e
Author: Suyuchen <suyuchen.wang@umontreal.ca>
Date:   Mon Jun 10 18:43:15 2024 -0400

    update vcr_wiki tasks

commit db8e718
Author: tianyu-z <zhangtianyupro@gmail.com>
Date:   Mon Jun 10 16:13:58 2024 -0400

    include the try-except logic for spacy

commit 427dabb
Author: Suyuchen <suyuchen.wang@umontreal.ca>
Date:   Mon Jun 10 15:51:05 2024 -0400

    add crossed_text to vcr_wiki output

commit 043b483
Author: tianyu-z <zhangtianyupro@gmail.com>
Date:   Mon Jun 10 15:47:00 2024 -0400

    switch logic

commit e1f04db
Author: tianyu-z <zhangtianyupro@gmail.com>
Date:   Mon Jun 10 02:38:21 2024 -0400

    modify the form of VCR

commit 96e8d98
Author: tianyu-z <zhangtianyupro@gmail.com>
Date:   Mon Jun 10 00:10:30 2024 -0400

    init include vcr

commit 374590b
Merge: 504685e cb3b9ce
Author: Kaichen Zhang - NTU <kaichenzhang358@outlook.com>
Date:   Fri Jun 7 20:25:48 2024 +0800

    Merge pull request #101 from Gumpest/main

    Update conbench in README

commit 504685e
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 6 15:42:15 2024 +0800

    Update README.md

commit cb3b9ce
Merge: c9793b3 67b64ea
Author: Yuan Zhang <56063339+Gumpest@users.noreply.github.com>
Date:   Thu Jun 6 11:22:24 2024 +0800

    Merge branch 'EvolvingLMMs-Lab:main' into main

commit c9793b3
Author: Yuan Zhang <gump_well_done@163.com>
Date:   Thu Jun 6 11:21:05 2024 +0800

    update README

commit 67b64ea
Merge: 8ee7848 5fd6845
Author: Li Bo <drluodian@gmail.com>
Date:   Wed Jun 5 23:12:58 2024 +0800

    Merge pull request #100 from Gumpest/main

    add Conbench

commit 5fd6845
Author: Yuan Zhang <gump_well_done@163.com>
Date:   Wed Jun 5 21:52:31 2024 +0800

    add conbench

commit 8ee7848
Merge: 747e197 6fefaf7
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Jun 4 17:09:33 2024 +0800

    Merge pull request #95 from AtsuMiyai/new_task/upd

    add MM-UPD

commit 747e197
Merge: 4854a34 0584307
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Jun 4 17:09:04 2024 +0800

    Merge pull request #97 from CaraJ7/update

    Add MathVerse in README.md

commit 6fefaf7
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Tue Jun 4 17:36:39 2024 +0900

    update utils.py for leaderboard submission

commit 5f4fe36
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Sun Jun 2 23:28:27 2024 +0900

    slightly change query_prompt for the reproduction

commit 0584307
Author: CaraJ7 <1350074492@qq.com>
Date:   Sun Jun 2 17:05:28 2024 +0800

    Add MathVerse in README.md

commit 0581ab3
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Fri May 31 16:09:45 2024 +0900

    merge model_specific_prompt_kwargs and dataset_name into each task yaml

commit 4854a34
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Sat May 4 19:23:39 2024 +0800

    Group MMMU images into one image (#83)

    * update

    * update font

    * Add matplotlib.font_manager import in utils.py

    * Refactor font handling in add_order_label function in utils.py

    * group mmmu

    ---------

    Co-authored-by: Li Bo <drluodian@gmail.com>

commit d224794
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Wed May 29 15:15:59 2024 +0900

    add upd

commit 453e793
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Wed May 29 15:03:30 2024 +0900

    add upd

commit 909edd6
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Wed May 29 12:52:21 2024 +0900

    add upd

commit 7c1ac97
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Wed May 29 12:50:32 2024 +0900

    add upd

commit 811301c
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Wed May 29 12:46:58 2024 +0900

    add upd

commit 71401ba
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Wed May 29 12:41:21 2024 +0900

    add upd

commit 24dc435
Author: Bo Li <drluodian@gmail.com>
Date:   Mon May 27 10:17:32 2024 +0000

    fix compatibility issue of older version llava

commit 616edf4
Author: Bo Li <drluodian@gmail.com>
Date:   Mon May 27 09:32:26 2024 +0000

    [Fix] import issues of multilingual llava and olympiadbench

commit 4c5a99e
Merge: 45c05b2 b05c3e2
Author: Li Bo <drluodian@gmail.com>
Date:   Mon May 27 14:19:53 2024 +0800

    Merge pull request #87 from vfragoso/vifragos/phi3v

    Adding microsoft/Phi-3-vision-128k-instruct model.

commit b05c3e2
Author: Victor Fragoso <victor.fragoso@microsoft.com>
Date:   Fri May 24 16:36:37 2024 +0000

    Adding documentation of Phi3v class.

commit c200897
Author: Victor Fragoso <victor.fragoso@microsoft.com>
Date:   Fri May 24 16:25:02 2024 +0000

    Adding prompt arguments for Phi3v on MathVista-TestMini

commit 7f9fb6b
Author: Victor Fragoso <victor.fragoso@microsoft.com>
Date:   Fri May 24 13:24:16 2024 +0000

    Adding Phi3v model.

commit 45c05b2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu May 23 03:47:36 2024 +0000

    Set printing info for llava_hf to debug level

commit 53f013e
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu May 23 03:41:39 2024 +0000

    Fix pope random name in pope full

commit 22520a9
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu May 23 03:41:14 2024 +0000

    Add separated pope tasks by category

commit d1eefb1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu May 9 08:36:02 2024 +0000

    Update gitignore

commit b2b4dbd
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Mon May 20 07:45:11 2024 +0000

    Comment out Spice in caption task so that don't need to download stanford nlp model

commit 662f05c
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Mon May 20 03:13:13 2024 +0000

    Comment out parse result in xcomposer

commit 0932932
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu May 16 03:55:39 2024 +0000

    Fix instructblip qformer size mismatch and multi-images problem

commit 557a6a3
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu May 16 03:11:41 2024 +0000

    Remove redundant code in fuyu

commit 6aeb550
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu May 16 01:45:24 2024 +0000

    Fix idefics2 llava in the wild bugs

commit aea80e6
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Wed May 15 11:07:35 2024 +0000

    Better task list_with_num

commit 3c12a08
Author: Li Bo <drluodian@gmail.com>
Date:   Sat May 18 02:35:52 2024 +0800

    Update LICENSE

commit 82317a6
Author: Li Bo <drluodian@gmail.com>
Date:   Sat May 18 02:29:09 2024 +0800

    Update LICENSE

commit a8bba1c
Author: Li Bo <drluodian@gmail.com>
Date:   Sat May 18 02:28:03 2024 +0800

    Create LICENSE

commit caa5893
Merge: c094448 423b006
Author: Li Bo <drluodian@gmail.com>
Date:   Mon May 13 11:45:26 2024 +0800

    Merge pull request #73 from EvolvingLMMs-Lab/kc/qwen_vl_api

    [Feat] Add qwen vl api

commit c094448
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat May 11 06:11:19 2024 +0000

    Fix llava_hf image tokens number issue

commit 64f07e4
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu May 9 02:04:10 2024 +0000

    Fix endless warning for llava_hf generation

commit 8aaa828
Author: Bo Li <drluodian@gmail.com>
Date:   Thu May 2 06:13:56 2024 +0000

    Add model_name parameter to Llava constructor

commit 7847dc4
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Tue May 7 03:15:59 2024 +0000

    Parse result for llava_hf 1.6

commit 3e56b4f
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Tue May 7 03:09:56 2024 +0000

    Fix llava_hf generation for 1.6

commit fa3ff92
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Mon May 6 08:32:57 2024 +0000

    Fix llava conv template for llama3

commit 423b006
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun May 5 07:54:52 2024 +0000

    Add qwen vl api

commit b7fd7a9
Merge: 986139a c5a130b
Author: Li Bo <drluodian@gmail.com>
Date:   Sun May 5 13:19:48 2024 +0800

    Merge pull request #59 from EvolvingLMMs-Lab/add_idefics2

    add idefics2

commit 986139a
Merge: b46239c 8d3526c
Author: Li Bo <drluodian@gmail.com>
Date:   Fri May 3 01:18:18 2024 +0800

    Merge pull request #36 from cocoshe/main

    [Fix] repr llava doc

commit b46239c
Merge: bc69a74 373265f
Author: Li Bo <drluodian@gmail.com>
Date:   Fri May 3 01:17:34 2024 +0800

    Merge pull request #56 from gagan3012/main

    Multilingual LLava bench

commit bc69a74
Merge: eef3aeb 626e8a9
Author: Li Bo <drluodian@gmail.com>
Date:   Fri May 3 01:12:14 2024 +0800

    Merge pull request #70 from hunterheiden/hsh/new_task/WebSRC

    Bugfix: WebSRC should be token-level F1 NOT character-level

commit 626e8a9
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Thu May 2 09:31:03 2024 -0400

    Bugfix: WebSRC should be token-level F1 NOT character-level

commit eef3aeb
Merge: c4e9dd9 9bca441
Author: Li Bo <drluodian@gmail.com>
Date:   Thu May 2 14:38:17 2024 +0800

    Merge pull request #69 from hunterheiden/hsh/new_task/WebSRC

    [New Task] WebSRC (multimodal Q&A on web screenshots)

commit 9bca441
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Wed May 1 11:07:29 2024 -0400

    Add code to enable compilation of submission for WebSRC test split

commit 7687495
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Wed May 1 10:47:32 2024 -0400

    Draft and validate websrc eval on dev split

commit 4eebd3e
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Wed May 1 10:46:54 2024 -0400

    Update main README with new task names

commit 35fe80b
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Wed May 1 10:46:20 2024 -0400

    Draft README for WebSRC

commit 955bd06
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Tue Apr 30 10:16:21 2024 -0400

    Init webSRC

commit c4e9dd9
Merge: d8a3a99 319afcc
Author: Li Bo <drluodian@gmail.com>
Date:   Fri Apr 26 14:37:22 2024 +0800

    Merge pull request #63 from hunterheiden/hsh/new_task/screenspot

    New Task: ScreenSpot - Grounding (REC) and instruction generation (REG) on screens

commit 319afcc
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Thu Apr 25 11:44:34 2024 -0400

    slight update

commit 2f3811c
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Thu Apr 25 11:41:04 2024 -0400

    Add README file specific to ScreenSpot

commit 28962cb
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Wed Apr 24 11:52:33 2024 -0400

    Update README to reflect new tasks

commit e457cfb
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Tue Apr 23 18:33:16 2024 -0400

    Create ScreenSpot on clean branch

commit d8a3a99
Merge: 3dcd015 ed17129
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Apr 23 10:34:03 2024 +0800

    Merge pull request #61 from tupini07/patch-1

    Fix typo in Qwen-VL that was causing "reference before assignment"

commit ed17129
Author: Andrea Tupini <tupini07@gmail.com>
Date:   Mon Apr 22 14:56:41 2024 -0600

    refactor query construction for clarity

commit cd87420
Author: Andrea Tupini <tupini07@gmail.com>
Date:   Mon Apr 22 14:54:29 2024 -0600

    convert contexts to list if necessary and remove unnecessary construction of `questions`

commit 8557367
Author: Andrea Tupini <tupini07@gmail.com>
Date:   Mon Apr 22 14:47:33 2024 -0600

    Fix typo in qwen_vl that was causing "reference before assignment"

commit 3dcd015
Merge: 95df9fe 743673a
Author: Li Bo <drluodian@gmail.com>
Date:   Sat Apr 20 22:03:16 2024 +0800

    Merge pull request #60 from CaraJ7/main

    Add MathVerse

commit 743673a
Merge: c1a5472 95df9fe
Author: CaraJ7 <1350074492@qq.com>
Date:   Sat Apr 20 21:49:02 2024 +0800

    Merge branch 'main' of https://github.com/EvolvingLMMs-Lab/lmms-eval

commit c1a5472
Author: CaraJ7 <1350074492@qq.com>
Date:   Sat Apr 20 21:45:34 2024 +0800

    Add MathVerse

commit 373265f
Author: Gagan Bhatia <49101362+gagan3012@users.noreply.github.com>
Date:   Fri Apr 12 17:21:39 2024 -0700

    Add files via upload

commit d853051
Author: Gagan Bhatia <49101362+gagan3012@users.noreply.github.com>
Date:   Fri Apr 12 17:19:49 2024 -0700

    Create README.md

commit 22a4958
Author: Bo Li <bo.li01@bytedance.com>
Date:   Thu Apr 4 17:12:43 2024 +0000

    [WIP] adding mmbench dev evaluation (#75)

    * WIP

    * Update GPT evaluation model name and sys prompt

    * 🛠️ Scale accuracy to percentage

    The accuracy value is now multiplied by 100 in the aggregation function to represent it as a percentage. Regarding the evaluation process, `math` module importation and refactoring reduce progress log verbosity by logging every 100 evaluations instead of 10. It prevents potential logging overflow. Handling of NaN values is added to ensure 'default_value' is set in case of missing data, avoiding errors in split, category, and l2-category assignments. Finally, reporting of categorical and l2-categorical accuracies is streamlined through a new `calculate_hit_rates` function, improving code readability and maintenance.

    Issue refs: #1427, #1533

    * Update GPT evaluation model name and API configuration

    * Refactor MMBench_Evaluator class to handle missing columns

    * Add print statements for detailed results in MMBench-CN(CC), MMBench-CN(Dev), and MMBench-EN(Dev) evaluations

    * Refactor MMBench-CN and MMBench-EN evaluation functions

    * 🔄 Refactor result processing and logging logic

    - Simplified the result processing functions across different utility modules (`cc_utils.py`, `cn_utils.py`, `en_utils.py`) to unify the handling of multiple-choice options. Now, all options ("A" to "E") are dynamically added to the result data, and default to "nan" if not provided in the document.
    - Removed redundant keys directly from the process results dict creation to avoid clutter and align with the new dynamic addition of options.
    - In `mmbench_evals.py`, removed the unnecessary check for all splits being 'dev' and streamlined the evaluation loop by eliminating the progress bar (tqdm) for a cleaner log output.
    - Commented-out code and verbose logging during evaluation, which may have interfered with performance, has been removed for a more efficient and less intrusive logging experience.

    This cleanup reduces redundancy in the codebase and improves evaluation performance.

    Refs #2045

    ---------

    Co-authored-by: Bo Li <bo.li01@bytedance.com>
    (cherry picked from commit a19278c)

commit 8d3526c
Author: cocoshe <1228759711@qq.com>
Date:   Thu Mar 28 13:38:36 2024 +0800

    fix doc
Luodian added a commit that referenced this pull request Sep 1, 2024
* chore: Update gpt_eval_model_name to "gpt-3.5-turbo" in mathvista.yaml

* Squashed commit of the following:

commit 994c9f97a2f8db3e9b7d7933d1e1680acde5b70b
Author: Yan Shu <570533048@qq.com>
Date:   Mon Jul 8 17:21:23 2024 +0800

    Add files via upload

* Squashed commit of the following:

commit e31cd78
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jul 10 12:08:08 2024 +1000

    chore: Update lmms_eval/models/vila.py and lmms_eval/tasks/__init__.py

commit 1d8c980
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Tue Jul 9 02:08:52 2024 +0000

    Rename xcomposer 4KHD

commit 6da76f3
Author: Bo Li <drluodian@gmail.com>
Date:   Tue Jul 9 11:55:56 2024 +1000

    Upgrade lmms-eval to version 0.2.1

commit cd18585
Author: Bo Li <drluodian@gmail.com>
Date:   Tue Jul 9 11:52:23 2024 +1000

    Upgrade lmms-eval to support more models and evaluation tasks

commit 672d7e5
Author: Bo Li <drluodian@gmail.com>
Date:   Tue Jul 9 11:43:41 2024 +1000

    feat: Add tie_weights parameter to Llava model initialization

commit 2037a86
Merge: e6844db a5c1869
Author: Bo Li <drluodian@gmail.com>
Date:   Tue Jul 9 11:37:12 2024 +1000

    Fix gen kwargs image aspect ratio in internvl2

commit a5c1869
Merge: 2ebec77 557083a
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Jul 9 09:15:56 2024 +0800

    Merge pull request #137 from shuyansy/main

    add MLVU task

commit 557083a
Author: Yan Shu <570533048@qq.com>
Date:   Mon Jul 8 16:56:50 2024 +0800

    Add files via upload

commit 2ebec77
Merge: 211bfed b23d349
Author: Li Bo <drluodian@gmail.com>
Date:   Mon Jul 8 11:53:06 2024 +0800

    Merge pull request #136 from Dousia/main

    Add detailcaps

commit b23d349
Author: ByteDance <bytedance@MacBook-Pro.local>
Date:   Sun Jul 7 23:24:19 2024 +0800

    Add install capture_metric in env

commit c6e211d
Author: ByteDance <bytedance@MacBook-Pro.local>
Date:   Sun Jul 7 23:04:13 2024 +0800

    Add detailcaps

commit 211bfed
Merge: 7c208b7 79514ee
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Jul 2 23:05:12 2024 +0800

    Merge pull request #133 from EvolvingLMMs-Lab/dev/wild_vision

    Add wild vision bench

commit 79514ee
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Mon Jul 1 15:10:02 2024 +0000

    Fixing handling None filtered score

commit 725fac2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Mon Jul 1 08:25:42 2024 +0000

    Fixing dataset name

commit 8d963e1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Mon Jul 1 08:24:51 2024 +0000

    Fixing scoring logic

commit e2990d0
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Mon Jul 1 06:06:57 2024 +0000

    Hardcode to keep image for wild vision

commit ed38173
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Mon Jul 1 06:06:38 2024 +0000

    Add wild vision 0617

commit 7c208b7
Author: Li Bo <drluodian@gmail.com>
Date:   Mon Jul 1 11:53:31 2024 +0800

    Update README.md

commit 39d40de
Merge: e19b43a ba7081c
Author: Li Bo <drluodian@gmail.com>
Date:   Mon Jul 1 11:47:09 2024 +0800

    Merge pull request #129 from Dannoopsy/mmbench_ru

    add task MMBench-ru

commit e19b43a
Merge: 11fd7e3 a0de897
Author: Li Bo <drluodian@gmail.com>
Date:   Mon Jul 1 11:46:58 2024 +0800

    Merge pull request #128 from Dannoopsy/gqa-ru

    add task gqa-ru

commit 11fd7e3
Merge: 383e7fe a752259
Author: Li Bo <drluodian@gmail.com>
Date:   Mon Jul 1 11:46:16 2024 +0800

    Merge pull request #130 from lscpku/vitatecs

    Add task VITATECS

commit a752259
Author: lscpku <lisc99@pku.edu.cn>
Date:   Fri Jun 28 20:37:06 2024 +0800

    create new task vitatecs

commit ba7081c
Author: Dannoopsy <63581325+Dannoopsy@users.noreply.github.com>
Date:   Fri Jun 28 12:21:05 2024 +0300

    change prompt to ru

commit 27ea9c0
Author: Dannoopsy <belopolskikh.dd@phystech.edu>
Date:   Thu Jun 27 17:17:29 2024 +0000

    add mmbench_ru_dev

commit 383e7fe
Merge: 06fa000 ed2e7f7
Author: Li Bo <drluodian@gmail.com>
Date:   Fri Jun 28 00:14:10 2024 +0800

    Merge pull request #126 from lorenzomammana/feature/external-package-integration

    External package integration using plugins

commit ed2e7f7
Merge: 03947e1 06fa000
Author: Lorenzo Mammana <mammanalorenzo@outlook.it>
Date:   Thu Jun 27 15:38:10 2024 +0000

    Merge branch 'main' into feature/external-package-integration

commit a0de897
Author: Dannoopsy <belopolskikh.dd@phystech.edu>
Date:   Tue Jun 25 11:11:37 2024 +0000

    new task gqa-ru

commit 06fa000
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Tue Jun 25 06:41:13 2024 +0000

    Fix vid mme post prompt issue

commit b388d79
Author: Li Bo <drluodian@gmail.com>
Date:   Sun Jun 23 22:31:16 2024 +0800

    Update activitynetqa_generation.yaml

commit 8f9d620
Author: Li Bo <drluodian@gmail.com>
Date:   Sun Jun 23 14:02:25 2024 +0800

    Update pyproject.toml

commit 6341b7c
Merge: fce85f1 903b042
Author: Li Bo <drluodian@gmail.com>
Date:   Sun Jun 23 14:02:02 2024 +0800

    Merge pull request #125 from EvolvingLMMs-Lab/dev/interleave

    [Model] aligned llava-interleave model results on video tasks

commit 903b042
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Jun 22 12:07:13 2024 +0000

    Remove unnecessary lines for video llava

commit d78ec86
Merge: ebe7217 fce85f1
Author: Li Bo <drluodian@gmail.com>
Date:   Sat Jun 22 13:57:31 2024 +0800

    Merge branch 'main' into dev/interleave

commit ebe7217
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Jun 22 02:57:08 2024 +0000

    Delete unnecessary lines

commit 120c474
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Jun 21 08:38:41 2024 +0000

    Revise model registry for llava_hf and longva

commit 7d6201f
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Jun 21 08:38:24 2024 +0000

    Add longva

commit 12f4806
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Jun 21 08:35:39 2024 +0000

    Remove unnecessary lines since use batched visuals now in llava

commit 12cea76
Author: Bo Li <drluodian@gmail.com>
Date:   Thu Jun 20 18:15:32 2024 +0000

    chore: Add loguru for logging in lmms_eval package

commit 03947e1
Author: Lorenzo Mammana <mammanalorenzo@outlook.it>
Date:   Wed Jun 5 13:40:41 2024 +0000

    feat: Allow including external tasks from plugins

commit b80a91f
Author: Lorenzo Mammana <mammanalorenzo@outlook.it>
Date:   Wed Jun 5 13:04:55 2024 +0000

    feat: Allow loading model configurations from other packages

commit 8ef2474
Author: Bo Li <drluodian@gmail.com>
Date:   Thu Jun 20 12:11:03 2024 +0000

    chore: Remove unused models from lmms_eval package

commit af38885
Author: Bo Li <drluodian@gmail.com>
Date:   Thu Jun 20 12:07:09 2024 +0000

    chore: Handle ImportError when importing models

    Handle the ImportError exception when importing models in the lmms_eval package. This change adds a try-except block to catch the ImportError and print an error message indicating the failed import. This will help with troubleshooting and identifying any issues with the model imports.

commit fce85f1
Merge: dbe6329 d94f83c
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 20 20:02:12 2024 +0800

    Merge pull request #120 from EvolvingLMMs-Lab/pufanyi/hf_dataset_docs

    Add docs for datasets upload to HF

commit dbe6329
Author: choiszt <ls2001927@sohu.com>
Date:   Thu Jun 20 15:14:21 2024 +0800

    update ablation for videomme datasets

commit d94f83c
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 20 13:30:59 2024 +0800

    Update README.md

commit cab8159
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 20 13:30:29 2024 +0800

    Update README.md

commit 4587665
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu Jun 20 03:55:30 2024 +0000

    Add llava_hf back to registry

commit 3463651
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu Jun 20 03:54:33 2024 +0000

    Remove handling non-visual loop in llava

commit cb0d3f4
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Thu Jun 20 02:11:18 2024 +0800

    update readme

commit 813877b
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Wed Jun 19 15:37:52 2024 +0800

    to sh script

commit a14684b
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Wed Jun 19 15:37:04 2024 +0800

    lint

commit d0f8851
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Wed Jun 19 15:36:48 2024 +0800

    small fix

commit 63748e9
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Wed Jun 19 15:36:43 2024 +0800

    small fix

commit 7f1159a
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Wed Jun 19 15:35:05 2024 +0800

    update preparation

commit 19f9bd6
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Wed Jun 19 15:23:24 2024 +0800

    docs

commit ce6f889
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Wed Jun 19 15:04:16 2024 +0800

    tutorial

commit f513c52
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jun 19 06:51:19 2024 +0000

    chore: Update dependencies to fix potential risks and improve compatibility

commit efb5295
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Wed Jun 19 10:25:58 2024 +0800

    Release llava-wilder

commit 742651f
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Wed Jun 19 07:44:26 2024 +0800

    feat: Add support for auto downloading tar format videos

commit 511b625
Merge: 22a4958 050b2c3
Author: Bo Li <drluodian@gmail.com>
Date:   Tue Jun 18 17:01:03 2024 +0000

    Merge branch 'main' of https://github.com/EvolvingLMMs-Lab/lmms-eval

commit 050b2c3
Merge: 74facb4 ef30651
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Jun 18 13:13:38 2024 +0800

    Merge pull request #114 from zjysteven/add-tinyllava

    add tinyllava

commit ef30651
Author: Jingyang Zhang <jingyang.zhang@duke.edu>
Date:   Mon Jun 17 17:57:02 2024 -0400

    fix typo

commit 9bab677
Merge: dbfb238 74facb4
Author: Jingyang Zhang <jingyang.zhang@duke.edu>
Date:   Sun Jun 16 10:56:05 2024 -0400

    Merge branch 'EvolvingLMMs-Lab:main' into add-tinyllava

commit 74facb4
Merge: 8ba192f d5df72d
Author: Li Bo <drluodian@gmail.com>
Date:   Sun Jun 16 17:59:19 2024 +0800

    Merge pull request #118 from teowu/main

    Fix the potential risk by PR #117

commit d5df72d
Merge: 5bf59ed 8ba192f
Author: Teo (Timothy) Wu Haoning <38696372+teowu@users.noreply.github.com>
Date:   Sun Jun 16 15:32:13 2024 +0800

    Merge branch 'EvolvingLMMs-Lab:main' into main

commit 5bf59ed
Author: teowu <realtimothyhwu@gmail.com>
Date:   Sun Jun 16 07:27:28 2024 +0000

    fix #117, allow auto download with tar format videos

commit 98b3955
Merge: a056f11 be9dada
Author: teowu <realtimothyhwu@gmail.com>
Date:   Sun Jun 16 07:25:07 2024 +0000

    Merge branch 'main' of https://github.com/teowu/lmms-eval into main

commit a056f11
Author: teowu <realtimothyhwu@gmail.com>
Date:   Sun Jun 16 07:23:54 2024 +0000

    fix #117, allow auto download with tar format videos

commit 8ba192f
Merge: 7cc2890 be9dada
Author: Li Bo <drluodian@gmail.com>
Date:   Sat Jun 15 17:30:59 2024 +0800

    Merge pull request #117 from teowu/main

    LongVideoBench for LMMs-Eval

commit be9dada
Merge: 62ea8ce 7cc2890
Author: Teo (Timothy) Wu Haoning <38696372+teowu@users.noreply.github.com>
Date:   Sat Jun 15 16:39:20 2024 +0800

    Merge pull request #1 from EvolvingLMMs-Lab/main

    Merge pull request #113 from teowu/main

commit 62ea8ce
Author: teowu <realtimothyhwu@gmail.com>
Date:   Sat Jun 15 08:30:11 2024 +0000

    LongVideoBench support: image LMMs (idefics2, phi3) and video LMMs (LLaVA-Next-Video-34B)

commit 7cc2890
Merge: 4bc7224 ea14cd4
Author: Li Bo <drluodian@gmail.com>
Date:   Sat Jun 15 14:10:22 2024 +0800

    Merge pull request #113 from teowu/main

    Q-Bench, Q-Bench2, A-Bench

commit dbfb238
Author: Jingyang <jingyang.zhang@duke.edu>
Date:   Fri Jun 14 16:20:42 2024 -0400

    add tinyllava

commit ea14cd4
Author: teowu <realtimothyhwu@gmail.com>
Date:   Fri Jun 14 15:01:52 2024 +0000

    Add qbench, qbench2, abench; fix phi3v as its current implementation does not support multi-image

commit 4bc7224
Merge: 2797987 bf14cb8
Author: Li Bo <drluodian@gmail.com>
Date:   Fri Jun 14 02:14:43 2024 +0800

    Merge pull request #111 from XinrunDu/main

    add II-Bench

commit bf14cb8
Author: XinrunDu <duxinrun2000@gmail.com>
Date:   Thu Jun 13 09:37:02 2024 +0000

    fix dataset_path

commit 6248113
Author: XinrunDu <duxinrun2000@gmail.com>
Date:   Thu Jun 13 09:32:06 2024 +0000

    add II-Bench

commit 2797987
Merge: 63d82f1 66d4bb2
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 13 11:14:47 2024 +0800

    Merge pull request #109 from EvolvingLMMs-Lab/pufanyi/update_version

    [Small Update] Update the version of LMMs-Eval

commit 66d4bb2
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Thu Jun 13 11:13:00 2024 +0800

    update version

commit 63d82f1
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 13 11:04:32 2024 +0800

    Update README.md

commit 44a3379
Merge: 5ed0035 0ce46d0
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 13 04:00:12 2024 +0800

    Merge pull request #105 from tianyu-z/main

    Include VCR

commit 0ce46d0
Author: Suyuchen <suyuchen.wang@umontreal.ca>
Date:   Wed Jun 12 15:56:34 2024 -0400

    update README.md

commit 46a88d8
Merge: 47b13b9 5ed0035
Author: Suyuchen <suyuchen.wang@umontreal.ca>
Date:   Wed Jun 12 15:50:26 2024 -0400

    merged readme.md

commit 47b13b9
Author: Suyuchen <suyuchen.wang@umontreal.ca>
Date:   Wed Jun 12 15:30:52 2024 -0400

    update aggregation function for vcr_wiki

commit 5ed0035
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 13 03:21:42 2024 +0800

    Update README.md

commit ed88068
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 13 03:13:59 2024 +0800

    Update README.md

commit fea3806
Merge: d99a24a 05dc8e8
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 13 03:11:49 2024 +0800

    Merge pull request #108 from EvolvingLMMs-Lab/internal_main_dev

    [Upgrade to v0.2] Embracing Video Evaluations with LMMs-Eval

commit 05dc8e8
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jun 12 15:56:04 2024 +0000

    chore: Update lmms-eval to support video evaluations for LLaVA models

commit cbeee20
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jun 12 15:50:30 2024 +0000

    chore: Update lmms-eval to support video evaluations for LLaVA models

commit f00d549
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jun 12 15:46:33 2024 +0000

    Update image alignment in README.md

commit 3415633
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jun 12 15:43:16 2024 +0000

    Update llava conv_template in lmms_eval/models/llava.py

commit 50575a9
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jun 12 15:39:03 2024 +0000

    chore: Update lmms-eval to support video evaluations for LLaVA models

commit c9b2252
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jun 12 15:33:48 2024 +0000

    Bump version to 0.2.0.dev0

commit 465bd42
Merge: e43bd84 d99a24a
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jun 12 15:04:25 2024 +0000

    Merge branch 'main' of https://github.com/EvolvingLMMs-Lab/lmms-eval into internal_main_dev

commit e43bd84
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jun 12 14:54:06 2024 +0000

    chore: Remove unnecessary files and code related to live_bench and sft_eval tasks

commit d99a24a
Merge: 374590b a66003b
Author: Li Bo <drluodian@gmail.com>
Date:   Wed Jun 12 19:45:57 2024 +0800

    Merge pull request #107 from AtsuMiyai/new_task/upd_update

    update gpt-3.5-turbo version

commit a66003b
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Wed Jun 12 17:05:17 2024 +0900

    update gpt-3.5-turbo version

commit ee91f27
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Wed Jun 12 16:50:53 2024 +0900

    update gpt-3.5-turbo version

commit 326b969
Author: tianyu-z <zhangtianyupro@gmail.com>
Date:   Mon Jun 10 20:07:40 2024 -0400

    include std and confidence interval

commit cd050d4
Author: Suyuchen <suyuchen.wang@umontreal.ca>
Date:   Mon Jun 10 18:49:47 2024 -0400

    update vcr_wiki tasks in README.md

commit 205721e
Author: Suyuchen <suyuchen.wang@umontreal.ca>
Date:   Mon Jun 10 18:43:15 2024 -0400

    update vcr_wiki tasks

commit db8e718
Author: tianyu-z <zhangtianyupro@gmail.com>
Date:   Mon Jun 10 16:13:58 2024 -0400

    include the try-except logic for spacy

commit 427dabb
Author: Suyuchen <suyuchen.wang@umontreal.ca>
Date:   Mon Jun 10 15:51:05 2024 -0400

    add crossed_text to vcr_wiki output

commit 043b483
Author: tianyu-z <zhangtianyupro@gmail.com>
Date:   Mon Jun 10 15:47:00 2024 -0400

    switch logic

commit e1f04db
Author: tianyu-z <zhangtianyupro@gmail.com>
Date:   Mon Jun 10 02:38:21 2024 -0400

    modify the form of VCR

commit 96e8d98
Author: tianyu-z <zhangtianyupro@gmail.com>
Date:   Mon Jun 10 00:10:30 2024 -0400

    init include vcr

commit 374590b
Merge: 504685e cb3b9ce
Author: Kaichen Zhang - NTU <kaichenzhang358@outlook.com>
Date:   Fri Jun 7 20:25:48 2024 +0800

    Merge pull request #101 from Gumpest/main

    Update conbench in README

commit 504685e
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 6 15:42:15 2024 +0800

    Update README.md

commit cb3b9ce
Merge: c9793b3 67b64ea
Author: Yuan Zhang <56063339+Gumpest@users.noreply.github.com>
Date:   Thu Jun 6 11:22:24 2024 +0800

    Merge branch 'EvolvingLMMs-Lab:main' into main

commit c9793b3
Author: Yuan Zhang <gump_well_done@163.com>
Date:   Thu Jun 6 11:21:05 2024 +0800

    update README

commit 67b64ea
Merge: 8ee7848 5fd6845
Author: Li Bo <drluodian@gmail.com>
Date:   Wed Jun 5 23:12:58 2024 +0800

    Merge pull request #100 from Gumpest/main

    add Conbench

commit 5fd6845
Author: Yuan Zhang <gump_well_done@163.com>
Date:   Wed Jun 5 21:52:31 2024 +0800

    add conbench

commit 8ee7848
Merge: 747e197 6fefaf7
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Jun 4 17:09:33 2024 +0800

    Merge pull request #95 from AtsuMiyai/new_task/upd

    add MM-UPD

commit 747e197
Merge: 4854a34 0584307
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Jun 4 17:09:04 2024 +0800

    Merge pull request #97 from CaraJ7/update

    Add MathVerse in README.md

commit 6fefaf7
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Tue Jun 4 17:36:39 2024 +0900

    update utils.py for leaderboard submission

commit 5f4fe36
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Sun Jun 2 23:28:27 2024 +0900

    slightly change query_prompt for the reproduction

commit 0584307
Author: CaraJ7 <1350074492@qq.com>
Date:   Sun Jun 2 17:05:28 2024 +0800

    Add MathVerse in README.md

commit 0581ab3
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Fri May 31 16:09:45 2024 +0900

    merge model_specific_prompt_kwargs and dataset_name into each task yaml

commit 4854a34
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Sat May 4 19:23:39 2024 +0800

    Group MMMU images into one image (#83)

    * update

    * update font

    * Add matplotlib.font_manager import in utils.py

    * Refactor font handling in add_order_label function in utils.py

    * group mmmu

    ---------

    Co-authored-by: Li Bo <drluodian@gmail.com>

commit d224794
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Wed May 29 15:15:59 2024 +0900

    add upd

commit 453e793
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Wed May 29 15:03:30 2024 +0900

    add upd

commit 909edd6
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Wed May 29 12:52:21 2024 +0900

    add upd

commit 7c1ac97
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Wed May 29 12:50:32 2024 +0900

    add upd

commit 811301c
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Wed May 29 12:46:58 2024 +0900

    add upd

commit 71401ba
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Wed May 29 12:41:21 2024 +0900

    add upd

commit 24dc435
Author: Bo Li <drluodian@gmail.com>
Date:   Mon May 27 10:17:32 2024 +0000

    fix compatibility issue of older version llava

commit 616edf4
Author: Bo Li <drluodian@gmail.com>
Date:   Mon May 27 09:32:26 2024 +0000

    [Fix] import issues of multilingual llava and olympiadbench

commit 4c5a99e
Merge: 45c05b2 b05c3e2
Author: Li Bo <drluodian@gmail.com>
Date:   Mon May 27 14:19:53 2024 +0800

    Merge pull request #87 from vfragoso/vifragos/phi3v

    Adding microsoft/Phi-3-vision-128k-instruct model.

commit b05c3e2
Author: Victor Fragoso <victor.fragoso@microsoft.com>
Date:   Fri May 24 16:36:37 2024 +0000

    Adding documentation of Phi3v class.

commit c200897
Author: Victor Fragoso <victor.fragoso@microsoft.com>
Date:   Fri May 24 16:25:02 2024 +0000

    Adding prompt arguments for Phi3v on MathVista-TestMini

commit 7f9fb6b
Author: Victor Fragoso <victor.fragoso@microsoft.com>
Date:   Fri May 24 13:24:16 2024 +0000

    Adding Phi3v model.

commit 45c05b2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu May 23 03:47:36 2024 +0000

    Set printing info for llava_hf to debug level

commit 53f013e
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu May 23 03:41:39 2024 +0000

    Fix pope random name in pope full

commit 22520a9
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu May 23 03:41:14 2024 +0000

    Add separated pope tasks by category

commit d1eefb1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu May 9 08:36:02 2024 +0000

    Update gitignore

commit b2b4dbd
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Mon May 20 07:45:11 2024 +0000

    Comment out Spice in caption task so that don't need to download stanford nlp model

commit 662f05c
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Mon May 20 03:13:13 2024 +0000

    Comment out parse result in xcomposer

commit 0932932
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu May 16 03:55:39 2024 +0000

    Fix instructblip qformer size mismatch and multi-images problem

commit 557a6a3
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu May 16 03:11:41 2024 +0000

    Remove redundant code in fuyu

commit 6aeb550
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu May 16 01:45:24 2024 +0000

    Fix idefics2 llava in the wild bugs

commit aea80e6
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Wed May 15 11:07:35 2024 +0000

    Better task list_with_num

commit 3c12a08
Author: Li Bo <drluodian@gmail.com>
Date:   Sat May 18 02:35:52 2024 +0800

    Update LICENSE

commit 82317a6
Author: Li Bo <drluodian@gmail.com>
Date:   Sat May 18 02:29:09 2024 +0800

    Update LICENSE

commit a8bba1c
Author: Li Bo <drluodian@gmail.com>
Date:   Sat May 18 02:28:03 2024 +0800

    Create LICENSE

commit caa5893
Merge: c094448 423b006
Author: Li Bo <drluodian@gmail.com>
Date:   Mon May 13 11:45:26 2024 +0800

    Merge pull request #73 from EvolvingLMMs-Lab/kc/qwen_vl_api

    [Feat] Add qwen vl api

commit c094448
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat May 11 06:11:19 2024 +0000

    Fix llava_hf image tokens number issue

commit 64f07e4
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu May 9 02:04:10 2024 +0000

    Fix endless warning for llava_hf generation

commit 8aaa828
Author: Bo Li <drluodian@gmail.com>
Date:   Thu May 2 06:13:56 2024 +0000

    Add model_name parameter to Llava constructor

commit 7847dc4
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Tue May 7 03:15:59 2024 +0000

    Parse result for llava_hf 1.6

commit 3e56b4f
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Tue May 7 03:09:56 2024 +0000

    Fix llava_hf generation for 1.6

commit fa3ff92
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Mon May 6 08:32:57 2024 +0000

    Fix llava conv template for llama3

commit 423b006
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun May 5 07:54:52 2024 +0000

    Add qwen vl api

commit b7fd7a9
Merge: 986139a c5a130b
Author: Li Bo <drluodian@gmail.com>
Date:   Sun May 5 13:19:48 2024 +0800

    Merge pull request #59 from EvolvingLMMs-Lab/add_idefics2

    add idefics2

commit 986139a
Merge: b46239c 8d3526c
Author: Li Bo <drluodian@gmail.com>
Date:   Fri May 3 01:18:18 2024 +0800

    Merge pull request #36 from cocoshe/main

    [Fix] repr llava doc

commit b46239c
Merge: bc69a74 373265f
Author: Li Bo <drluodian@gmail.com>
Date:   Fri May 3 01:17:34 2024 +0800

    Merge pull request #56 from gagan3012/main

    Multilingual LLava bench

commit bc69a74
Merge: eef3aeb 626e8a9
Author: Li Bo <drluodian@gmail.com>
Date:   Fri May 3 01:12:14 2024 +0800

    Merge pull request #70 from hunterheiden/hsh/new_task/WebSRC

    Bugfix: WebSRC should be token-level F1 NOT character-level

commit 626e8a9
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Thu May 2 09:31:03 2024 -0400

    Bugfix: WebSRC should be token-level F1 NOT character-level

commit eef3aeb
Merge: c4e9dd9 9bca441
Author: Li Bo <drluodian@gmail.com>
Date:   Thu May 2 14:38:17 2024 +0800

    Merge pull request #69 from hunterheiden/hsh/new_task/WebSRC

    [New Task] WebSRC (multimodal Q&A on web screenshots)

commit 9bca441
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Wed May 1 11:07:29 2024 -0400

    Add code to enable compilation of submission for WebSRC test split

commit 7687495
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Wed May 1 10:47:32 2024 -0400

    Draft and validate websrc eval on dev split

commit 4eebd3e
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Wed May 1 10:46:54 2024 -0400

    Update main README with new task names

commit 35fe80b
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Wed May 1 10:46:20 2024 -0400

    Draft README for WebSRC

commit 955bd06
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Tue Apr 30 10:16:21 2024 -0400

    Init webSRC

commit c4e9dd9
Merge: d8a3a99 319afcc
Author: Li Bo <drluodian@gmail.com>
Date:   Fri Apr 26 14:37:22 2024 +0800

    Merge pull request #63 from hunterheiden/hsh/new_task/screenspot

    New Task: ScreenSpot - Grounding (REC) and instruction generation (REG) on screens

commit 319afcc
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Thu Apr 25 11:44:34 2024 -0400

    slight update

commit 2f3811c
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Thu Apr 25 11:41:04 2024 -0400

    Add README file specific to ScreenSpot

commit 28962cb
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Wed Apr 24 11:52:33 2024 -0400

    Update README to reflect new tasks

commit e457cfb
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Tue Apr 23 18:33:16 2024 -0400

    Create ScreenSpot on clean branch

commit d8a3a99
Merge: 3dcd015 ed17129
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Apr 23 10:34:03 2024 +0800

    Merge pull request #61 from tupini07/patch-1

    Fix typo in Qwen-VL that was causing "reference before assignment"

commit ed17129
Author: Andrea Tupini <tupini07@gmail.com>
Date:   Mon Apr 22 14:56:41 2024 -0600

    refactor query construction for clarity

commit cd87420
Author: Andrea Tupini <tupini07@gmail.com>
Date:   Mon Apr 22 14:54:29 2024 -0600

    convert contexts to list if necessary and remove unnecessary construction of `questions`

commit 8557367
Author: Andrea Tupini <tupini07@gmail.com>
Date:   Mon Apr 22 14:47:33 2024 -0600

    Fix typo in qwen_vl that was causing "reference before assignment"

commit 3dcd015
Merge: 95df9fe 743673a
Author: Li Bo <drluodian@gmail.com>
Date:   Sat Apr 20 22:03:16 2024 +0800

    Merge pull request #60 from CaraJ7/main

    Add MathVerse

commit 743673a
Merge: c1a5472 95df9fe
Author: CaraJ7 <1350074492@qq.com>
Date:   Sat Apr 20 21:49:02 2024 +0800

    Merge branch 'main' of https://github.com/EvolvingLMMs-Lab/lmms-eval

commit c1a5472
Author: CaraJ7 <1350074492@qq.com>
Date:   Sat Apr 20 21:45:34 2024 +0800

    Add MathVerse

commit 373265f
Author: Gagan Bhatia <49101362+gagan3012@users.noreply.github.com>
Date:   Fri Apr 12 17:21:39 2024 -0700

    Add files via upload

commit d853051
Author: Gagan Bhatia <49101362+gagan3012@users.noreply.github.com>
Date:   Fri Apr 12 17:19:49 2024 -0700

    Create README.md

commit 22a4958
Author: Bo Li <bo.li01@bytedance.com>
Date:   Thu Apr 4 17:12:43 2024 +0000

    [WIP] adding mmbench dev evaluation (#75)

    * WIP

    * Update GPT evaluation model name and sys prompt

    * 🛠️ Scale accuracy to percentage

    The accuracy value is now multiplied by 100 in the aggregation function to represent it as a percentage. Regarding the evaluation process, `math` module importation and refactoring reduce progress log verbosity by logging every 100 evaluations instead of 10. It prevents potential logging overflow. Handling of NaN values is added to ensure 'default_value' is set in case of missing data, avoiding errors in split, category, and l2-category assignments. Finally, reporting of categorical and l2-categorical accuracies is streamlined through a new `calculate_hit_rates` function, improving code readability and maintenance.

    Issue refs: #1427, #1533

    * Update GPT evaluation model name and API configuration

    * Refactor MMBench_Evaluator class to handle missing columns

    * Add print statements for detailed results in MMBench-CN(CC), MMBench-CN(Dev), and MMBench-EN(Dev) evaluations

    * Refactor MMBench-CN and MMBench-EN evaluation functions

    * 🔄 Refactor result processing and logging logic

    - Simplified the result processing functions across different utility modules (`cc_utils.py`, `cn_utils.py`, `en_utils.py`) to unify the handling of multiple-choice options. Now, all options ("A" to "E") are dynamically added to the result data, and default to "nan" if not provided in the document.
    - Removed redundant keys directly from the process results dict creation to avoid clutter and align with the new dynamic addition of options.
    - In `mmbench_evals.py`, removed the unnecessary check for all splits being 'dev' and streamlined the evaluation loop by eliminating the progress bar (tqdm) for a cleaner log output.
    - Commented-out code and verbose logging during evaluation, which may have interfered with performance, has been removed for a more efficient and less intrusive logging experience.

    This cleanup reduces redundancy in the codebase and improves evaluation performance.

    Refs #2045

    ---------

    Co-authored-by: Bo Li <bo.li01@bytedance.com>
    (cherry picked from commit a19278c)

commit 8d3526c
Author: cocoshe <1228759711@qq.com>
Date:   Thu Mar 28 13:38:36 2024 +0800

    fix doc

* feat: Add LlavaOneVision model to available models

chore: Update sqlitedict dependency to version 2.1.0

* Revert "Squashed commit of the following:"

This reverts commit 11b00999df3c43cb225482e030b791b2d454124c.

* Refactor available models in lmms_eval

Remove duplicate entries for "llava_hf", "llava_onevision", and "longva" in the AVAILABLE_MODELS dictionary in lmms_eval/models/__init__.py.

* fix: Handle import errors in lmms_eval models/__init__.py

The code changes in this commit fix the handling of import errors in the lmms_eval/models/__init__.py file. Previously, when an import error occurred, the code simply ignored it. This commit updates the code to log an error message using the logger module when an import error occurs.

This commit also removes duplicate entries for "llava_hf", "llava_onevision", and "longva" in the AVAILABLE_MODELS dictionary.

Recent user commits:
- Refactor available models in lmms_eval
- Revert "Squashed commit of the following:"
- feat: Add LlavaOneVision model to available models
- chore: Update sqlitedict dependency to version 2.1.0

* fix: Handle import errors in lmms_eval models/__init__.py

* chore: Remove unused imports in lmms_eval/models/__init__.py and lmms_eval/tasks/vcr_wiki/utils.py

* Remove unused imports in lmms_eval/tasks/vcr_wiki/utils.py

* chore: Update lmms_eval/tasks/vcr_wiki/utils.py

This commit updates the `lmms_eval/tasks/vcr_wiki/utils.py` file. It removes unused imports and fixes the condition for loading Spacy models based on the `load_package` value in the config file. Additionally, it adds a debug log message when the Spacy models are not loaded due to `load_package` being set to False.

Remove unused imports in `lmms_eval/tasks/vcr_wiki/utils.py`

* feat: Add new subtasks to overall score calculation

The code changes in this commit add new subtasks to the overall score calculation in the `overall_score` function. The subtasks "ScanQA", "BLINK", "MathVerse", "SciVerse", and "Mantis" are included in the `categories` dictionary. This ensures that the scores for these subtasks are calculated and included in the evaluation results.

Remove unused imports and update subtask categories in `utils.py`

* feat: Add new subtasks to overall score calculation

* chore: Update lmms_eval/tasks/llava_interleave_bench/_default_template_interleave_yaml

Update the image aspect ratio in the default template for the llava_interleave_bench task. Change the value of "image_aspect_ratio" from "original" to "pad". This ensures that the generated images have a padded aspect ratio.

* if no response directly return 0

* Squashed commit of the following:

commit b2a009b
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Mon Jul 15 19:12:25 2024 -0700

    if no response directly return 0 (#142)

commit 5fc5f2f
Author: Kaichen Zhang - NTU <kaichenzhang358@outlook.com>
Date:   Tue Jul 16 10:12:11 2024 +0800

    Add Muirbench (#143)

    * handle gen kwargs in internvl2

    * Add muirbench

* Add files via upload

(cherry picked from commit 557083a)

* update

---------

Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
Co-authored-by: Yan Shu <570533048@qq.com>
Luodian added a commit that referenced this pull request Sep 1, 2024
* fix doc

* [WIP] adding mmbench dev evaluation (#75)

* WIP

* Update GPT evaluation model name and sys prompt

* 🛠️ Scale accuracy to percentage

The accuracy value is now multiplied by 100 in the aggregation function to represent it as a percentage. Regarding the evaluation process, `math` module importation and refactoring reduce progress log verbosity by logging every 100 evaluations instead of 10. It prevents potential logging overflow. Handling of NaN values is added to ensure 'default_value' is set in case of missing data, avoiding errors in split, category, and l2-category assignments. Finally, reporting of categorical and l2-categorical accuracies is streamlined through a new `calculate_hit_rates` function, improving code readability and maintenance.

Issue refs: #1427, #1533

* Update GPT evaluation model name and API configuration

* Refactor MMBench_Evaluator class to handle missing columns

* Add print statements for detailed results in MMBench-CN(CC), MMBench-CN(Dev), and MMBench-EN(Dev) evaluations

* Refactor MMBench-CN and MMBench-EN evaluation functions

* 🔄 Refactor result processing and logging logic

- Simplified the result processing functions across different utility modules (`cc_utils.py`, `cn_utils.py`, `en_utils.py`) to unify the handling of multiple-choice options. Now, all options ("A" to "E") are dynamically added to the result data, and default to "nan" if not provided in the document.
- Removed redundant keys directly from the process results dict creation to avoid clutter and align with the new dynamic addition of options.
- In `mmbench_evals.py`, removed the unnecessary check for all splits being 'dev' and streamlined the evaluation loop by eliminating the progress bar (tqdm) for a cleaner log output.
- Commented-out code and verbose logging during evaluation, which may have interfered with performance, has been removed for a more efficient and less intrusive logging experience.

This cleanup reduces redundancy in the codebase and improves evaluation performance.

Refs #2045

---------

Co-authored-by: Bo Li <bo.li01@bytedance.com>
(cherry picked from commit a19278c2ea6ddcbca64d3cc7f4efec7fe5775121)

* Create README.md

* Add files via upload

* Add MathVerse

* Fix typo in qwen_vl that was causing "reference before assignment"

* convert contexts to list if necessary and remove unnecessary construction of `questions`

* refactor query construction for clarity

* Create ScreenSpot on clean branch

* Update README to reflect new tasks

* Add README file specific to ScreenSpot

* slight update

* Init webSRC

* Draft README for WebSRC

* Update main README with new task names

* Draft and validate websrc eval on dev split

* Add code to enable compilation of submission for WebSRC test split

* Bugfix: WebSRC should be token-level F1 NOT character-level

* Add qwen vl api

* Fix llava conv template for llama3

* Fix llava_hf generation for 1.6

* Parse result for llava_hf 1.6

* Add model_name parameter to Llava constructor

* Fix endless warning for llava_hf generation

* Fix llava_hf image tokens number issue

* Create LICENSE

* Update LICENSE

* Update LICENSE

* Better task list_with_num

* Fix idefics2 llava in the wild bugs

* Remove redundant code in fuyu

* Fix instructblip qformer size mismatch and multi-images problem

* Comment out parse result in xcomposer

* Comment out Spice in caption task so that don't need to download stanford nlp model

* Update gitignore

* Add separated pope tasks by category

* Fix pope random name in pope full

* Set printing info for llava_hf to debug level

* Adding Phi3v model.

* Adding prompt arguments for Phi3v on MathVista-TestMini

* Adding documentation of Phi3v class.

* [Fix] import issues of multilingual llava and olympiadbench

* fix compatibility issue of older version llava

* add upd

* add upd

* add upd

* add upd

* add upd

* add upd

* Group MMMU images into one image (#83)

* update

* update font

* Add matplotlib.font_manager import in utils.py

* Refactor font handling in add_order_label function in utils.py

* group mmmu

---------

Co-authored-by: Li Bo <drluodian@gmail.com>

* merge model_specific_prompt_kwargs and dataset_name into each task yaml

* Add MathVerse in README.md

* slightly change query_prompt for the reproduction

* update utils.py for leaderboard submission

* add conbench

* update README

* Update README.md

* init include vcr

* modify the form of VCR

* switch logic

* add crossed_text to vcr_wiki output

* include the try-except logic for spacy

* update vcr_wiki tasks

* update vcr_wiki tasks in README.md

* include std and confidence interval

* update gpt-3.5-turbo version

* update gpt-3.5-turbo version

* chore: Remove unnecessary files and code related to live_bench and sft_eval tasks

* Bump version to 0.2.0.dev0

* chore: Update lmms-eval to support video evaluations for LLaVA models

* Update llava conv_template in lmms_eval/models/llava.py

* Update image alignment in README.md

* chore: Update lmms-eval to support video evaluations for LLaVA models

* chore: Update lmms-eval to support video evaluations for LLaVA models

* Update README.md

* Update README.md

* update aggregation function for vcr_wiki

* update README.md

* Update README.md

* update version

* add II-Bench

* fix dataset_path

* Add qbench, qbench2, abench; fix phi3v as its current implementation does not support multi-image

* add tinyllava

* LongVideoBench support: image LMMs (idefics2, phi3) and video LMMs (LLaVA-Next-Video-34B)

* fix #117, allow auto download with tar format videos

* fix #117, allow auto download with tar format videos

* fix typo

* feat: Add support for auto downloading tar format videos

* Release llava-wilder

* chore: Update dependencies to fix potential risks and improve compatibility

* tutorial

* docs

* update preparation

* small fix

* small fix

* lint

* to sh script

* update readme

* Remove handling non-visual loop in llava

* Add llava_hf back to registry

* Update README.md

* Update README.md

* update ablation for videomme datasets

* chore: Handle ImportError when importing models

Handle the ImportError exception when importing models in the lmms_eval package. This change adds a try-except block to catch the ImportError and print an error message indicating the failed import. This will help with troubleshooting and identifying any issues with the model imports.

* chore: Remove unused models from lmms_eval package

* feat: Allow loading model configurations from other packages

* feat: Allow including external tasks from plugins

* chore: Add loguru for logging in lmms_eval package

* Remove unnecessary lines since use batched visuals now in llava

* Add longva

* Revise model registry for llava_hf and longva

* Delete unnecessary lines

* Remove unnecessary lines for video llava

* Update pyproject.toml

* Update activitynetqa_generation.yaml

* Fix vid mme post prompt issue

* new task gqa-ru

* add mmbench_ru_dev

* change prompt to ru

* create new task vitatecs

* Update README.md

* Add wild vision 0617

* Hardcode to keep image for wild vision

* Fixing scoring logic

* Fixing dataset name

* Fixing handling None filtered score

* Add detailcaps

* Add install capture_metric in env

* Add files via upload

* feat: Add tie_weights parameter to Llava model initialization

* Upgrade lmms-eval to support more models and evaluation tasks

* Upgrade lmms-eval to version 0.2.1

* Rename xcomposer 4KHD

* chore: Update lmms_eval/models/vila.py and lmms_eval/tasks/__init__.py

* Update utils.py

* Update _default_template_vcr_yaml

* add process sync via temp file in lmms_eval/evaluator.py

* Update utils.py

* Update _default_template_vcr_yaml

* Add muirbench

* Squashed commit of the following:

commit dfdba507b5fbe985b0030ffec575f9f2638bc1ed
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Jul 16 11:13:52 2024 +0800

    merge ov evals (#144)

    * chore: Update gpt_eval_model_name to "gpt-3.5-turbo" in mathvista.yaml

    * Squashed commit of the following:

    commit 994c9f97a2f8db3e9b7d7933d1e1680acde5b70b
    Author: Yan Shu <570533048@qq.com>
    Date:   Mon Jul 8 17:21:23 2024 +0800

        Add files via upload

    * Squashed commit of the following:

    commit e31cd7883d4555c7530795c7f102b8d78cbd372f
    Author: Bo Li <drluodian@gmail.com>
    Date:   Wed Jul 10 12:08:08 2024 +1000

        chore: Update lmms_eval/models/vila.py and lmms_eval/tasks/__init__.py

    commit 1d8c980d1089f9d7702c3b92d5c85039f2809c6d
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Tue Jul 9 02:08:52 2024 +0000

        Rename xcomposer 4KHD

    commit 6da76f36ecf5f9aa73057e767a4fcb60c99ff896
    Author: Bo Li <drluodian@gmail.com>
    Date:   Tue Jul 9 11:55:56 2024 +1000

        Upgrade lmms-eval to version 0.2.1

    commit cd1858523fcd8630082cbefba8710e0de3ee8805
    Author: Bo Li <drluodian@gmail.com>
    Date:   Tue Jul 9 11:52:23 2024 +1000

        Upgrade lmms-eval to support more models and evaluation tasks

    commit 672d7e5bb49dcb34e1b2fdeb09f3f4588dc583a6
    Author: Bo Li <drluodian@gmail.com>
    Date:   Tue Jul 9 11:43:41 2024 +1000

        feat: Add tie_weights parameter to Llava model initialization

    commit 2037a86261b55fa42b8ba3a04eab192b3e69d6ea
    Merge: e6844db1 a5c18692
    Author: Bo Li <drluodian@gmail.com>
    Date:   Tue Jul 9 11:37:12 2024 +1000

        Fix gen kwargs image aspect ratio in internvl2

    commit a5c186925de989b616f58a35ece36065a32b4594
    Merge: 2ebec77f 557083a1
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Jul 9 09:15:56 2024 +0800

        Merge pull request #137 from shuyansy/main

        add MLVU task

    commit 557083a156c3dd67ac79e22b4202e9b69b6b00f4
    Author: Yan Shu <570533048@qq.com>
    Date:   Mon Jul 8 16:56:50 2024 +0800

        Add files via upload

    commit 2ebec77f5606d79e9a7b995970e32792050606a1
    Merge: 211bfede b23d349e
    Author: Li Bo <drluodian@gmail.com>
    Date:   Mon Jul 8 11:53:06 2024 +0800

        Merge pull request #136 from Dousia/main

        Add detailcaps

    commit b23d349e46d60dc149ffaa54d6e019f4996ed92d
    Author: ByteDance <bytedance@MacBook-Pro.local>
    Date:   Sun Jul 7 23:24:19 2024 +0800

        Add install capture_metric in env

    commit c6e211d5f9dbb7572d3a141b6504cb1ca2007c33
    Author: ByteDance <bytedance@MacBook-Pro.local>
    Date:   Sun Jul 7 23:04:13 2024 +0800

        Add detailcaps

    commit 211bfedebad243ef82a8b0be36c3b5a9b9cb2f72
    Merge: 7c208b76 79514eee
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Jul 2 23:05:12 2024 +0800

        Merge pull request #133 from EvolvingLMMs-Lab/dev/wild_vision

        Add wild vision bench

    commit 79514eeebcfd6f655be2a10c776037d12a7b7214
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Mon Jul 1 15:10:02 2024 +0000

        Fixing handling None filtered score

    commit 725fac2781446958b905e1e6c6eb3c0a8e582e49
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Mon Jul 1 08:25:42 2024 +0000

        Fixing dataset name

    commit 8d963e132ac03fc0d835d480cfcfcabe72af143c
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Mon Jul 1 08:24:51 2024 +0000

        Fixing scoring logic

    commit e2990d0a69e876721256fdf946c68ba7ae0cbdc1
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Mon Jul 1 06:06:57 2024 +0000

        Hardcode to keep image for wild vision

    commit ed381736730d8fb785b4ee919fdb751734ecef25
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Mon Jul 1 06:06:38 2024 +0000

        Add wild vision 0617

    commit 7c208b76640c986cfe94233dce735c3ca4ad4319
    Author: Li Bo <drluodian@gmail.com>
    Date:   Mon Jul 1 11:53:31 2024 +0800

        Update README.md

    commit 39d40dea47bc59ff04e8b0cbc445345098debc9a
    Merge: e19b43a3 ba7081c0
    Author: Li Bo <drluodian@gmail.com>
    Date:   Mon Jul 1 11:47:09 2024 +0800

        Merge pull request #129 from Dannoopsy/mmbench_ru

        add task MMBench-ru

    commit e19b43a3a1e7212e623061b164b0419cc0dda689
    Merge: 11fd7e3f a0de8970
    Author: Li Bo <drluodian@gmail.com>
    Date:   Mon Jul 1 11:46:58 2024 +0800

        Merge pull request #128 from Dannoopsy/gqa-ru

        add task gqa-ru

    commit 11fd7e3fc05908aeb01e4a6161a7b55cd38b3122
    Merge: 383e7fea a7522592
    Author: Li Bo <drluodian@gmail.com>
    Date:   Mon Jul 1 11:46:16 2024 +0800

        Merge pull request #130 from lscpku/vitatecs

        Add task VITATECS

    commit a75225926e5954f85466d257f99acf0163fde596
    Author: lscpku <lisc99@pku.edu.cn>
    Date:   Fri Jun 28 20:37:06 2024 +0800

        create new task vitatecs

    commit ba7081c0abac840002d320e30733e891298dfa11
    Author: Dannoopsy <63581325+Dannoopsy@users.noreply.github.com>
    Date:   Fri Jun 28 12:21:05 2024 +0300

        change prompt to ru

    commit 27ea9c0055a8abf3a8198829b8617018479918e2
    Author: Dannoopsy <belopolskikh.dd@phystech.edu>
    Date:   Thu Jun 27 17:17:29 2024 +0000

        add mmbench_ru_dev

    commit 383e7fead3138aedf62e9c0ec48303835ef26e2a
    Merge: 06fa000f ed2e7f79
    Author: Li Bo <drluodian@gmail.com>
    Date:   Fri Jun 28 00:14:10 2024 +0800

        Merge pull request #126 from lorenzomammana/feature/external-package-integration

        External package integration using plugins

    commit ed2e7f792151d21bce8f1c498270b9391e1d5c85
    Merge: 03947e14 06fa000f
    Author: Lorenzo Mammana <mammanalorenzo@outlook.it>
    Date:   Thu Jun 27 15:38:10 2024 +0000

        Merge branch 'main' into feature/external-package-integration

    commit a0de89708d5e6f259bb17f0eaace3c5b901b275c
    Author: Dannoopsy <belopolskikh.dd@phystech.edu>
    Date:   Tue Jun 25 11:11:37 2024 +0000

        new task gqa-ru

    commit 06fa000f60d3e4d160fac8ceb9959ae92a98f752
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Tue Jun 25 06:41:13 2024 +0000

        Fix vid mme post prompt issue

    commit b388d79e0df6f60068196cb7047453ebd22d6ef1
    Author: Li Bo <drluodian@gmail.com>
    Date:   Sun Jun 23 22:31:16 2024 +0800

        Update activitynetqa_generation.yaml

    commit 8f9d620fcd9d0a0742ee6bcf51ea63bd6b088a36
    Author: Li Bo <drluodian@gmail.com>
    Date:   Sun Jun 23 14:02:25 2024 +0800

        Update pyproject.toml

    commit 6341b7c15ce9fb28eb06b067ddb299d6cf2e16c3
    Merge: fce85f1b 903b042b
    Author: Li Bo <drluodian@gmail.com>
    Date:   Sun Jun 23 14:02:02 2024 +0800

        Merge pull request #125 from EvolvingLMMs-Lab/dev/interleave

        [Model] aligned llava-interleave model results on video tasks

    commit 903b042be016016d4ebeecb07701f3076a2d323c
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Sat Jun 22 12:07:13 2024 +0000

        Remove unnecessary lines for video llava

    commit d78ec86407b729a964906a8c2e50704b4bc74d06
    Merge: ebe7217a fce85f1b
    Author: Li Bo <drluodian@gmail.com>
    Date:   Sat Jun 22 13:57:31 2024 +0800

        Merge branch 'main' into dev/interleave

    commit ebe7217a486c1e754e42c2cbdb834e09fbbcc9b0
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Sat Jun 22 02:57:08 2024 +0000

        Delete unnecessary lines

    commit 120c474b056f9177c74e1fd9691d59e2f234b785
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Fri Jun 21 08:38:41 2024 +0000

        Revise model registry for llava_hf and longva

    commit 7d6201f921088afd3f52a35076e3c6fcc9aa518c
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Fri Jun 21 08:38:24 2024 +0000

        Add longva

    commit 12f480699c71a12a24d4349d9b0681933201a3a6
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Fri Jun 21 08:35:39 2024 +0000

        Remove unnecessary lines since use batched visuals now in llava

    commit 12cea76f1f0f14b1fd1007c9d39a9b0557368637
    Author: Bo Li <drluodian@gmail.com>
    Date:   Thu Jun 20 18:15:32 2024 +0000

        chore: Add loguru for logging in lmms_eval package

    commit 03947e14a46fd25b412931f7c9c25f4a2971d0b4
    Author: Lorenzo Mammana <mammanalorenzo@outlook.it>
    Date:   Wed Jun 5 13:40:41 2024 +0000

        feat: Allow including external tasks from plugins

    commit b80a91f73e15ddd0b0ce1322d7d121fa14030eed
    Author: Lorenzo Mammana <mammanalorenzo@outlook.it>
    Date:   Wed Jun 5 13:04:55 2024 +0000

        feat: Allow loading model configurations from other packages

    commit 8ef24740dd48a11c97eb627f2fff4aca107fef0d
    Author: Bo Li <drluodian@gmail.com>
    Date:   Thu Jun 20 12:11:03 2024 +0000

        chore: Remove unused models from lmms_eval package

    commit af38885fc2e066f5ea44388f33e07176f836fe28
    Author: Bo Li <drluodian@gmail.com>
    Date:   Thu Jun 20 12:07:09 2024 +0000

        chore: Handle ImportError when importing models

        Handle the ImportError exception when importing models in the lmms_eval package. This change adds a try-except block to catch the ImportError and print an error message indicating the failed import. This will help with troubleshooting and identifying any issues with the model imports.

    commit fce85f1b03ff7043b29dee787c5d17a08dd2687a
    Merge: dbe63293 d94f83cb
    Author: Li Bo <drluodian@gmail.com>
    Date:   Thu Jun 20 20:02:12 2024 +0800

        Merge pull request #120 from EvolvingLMMs-Lab/pufanyi/hf_dataset_docs

        Add docs for datasets upload to HF

    commit dbe63293245a5141fdfd80bda7657c304f6bd32f
    Author: choiszt <ls2001927@sohu.com>
    Date:   Thu Jun 20 15:14:21 2024 +0800

        update ablation for videomme datasets

    commit d94f83cb3f08b61a2c75cc4326e58792100605b3
    Author: Li Bo <drluodian@gmail.com>
    Date:   Thu Jun 20 13:30:59 2024 +0800

        Update README.md

    commit cab8159ff35db330536c0b6dfb4b0a3b24142209
    Author: Li Bo <drluodian@gmail.com>
    Date:   Thu Jun 20 13:30:29 2024 +0800

        Update README.md

    commit 45876652a877a8006b828f32f5cc4660629f9190
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Thu Jun 20 03:55:30 2024 +0000

        Add llava_hf back to registry

    commit 3463651b8c54d36cd94169e3d376f5ed225a195a
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Thu Jun 20 03:54:33 2024 +0000

        Remove handling non-visual loop in llava

    commit cb0d3f49b72790b081f981e0e6147131542f7f68
    Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Date:   Thu Jun 20 02:11:18 2024 +0800

        update readme

    commit 813877bfe5ac590cdbe92dd74d18f83a2091f748
    Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Date:   Wed Jun 19 15:37:52 2024 +0800

        to sh script

    commit a14684b8557d5894976448a5c559ed7a66a6cf16
    Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Date:   Wed Jun 19 15:37:04 2024 +0800

        lint

    commit d0f8851d42ba31f5da2a7a65e91499db45174dbc
    Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Date:   Wed Jun 19 15:36:48 2024 +0800

        small fix

    commit 63748e9718f287ad433afc90e340b5e17a89c1ed
    Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Date:   Wed Jun 19 15:36:43 2024 +0800

        small fix

    commit 7f1159a1fe04cfb783dc31d4fbdef3bda0ce19e4
    Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Date:   Wed Jun 19 15:35:05 2024 +0800

        update preparation

    commit 19f9bd621c76a483ff98f8c7eb78f64753da683a
    Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Date:   Wed Jun 19 15:23:24 2024 +0800

        docs

    commit ce6f889ba02d819979c7922f6336cf4f1f718f65
    Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Date:   Wed Jun 19 15:04:16 2024 +0800

        tutorial

    commit f513c520c2a3dad26d2b2ca5c4ed4db05a493c73
    Author: Bo Li <drluodian@gmail.com>
    Date:   Wed Jun 19 06:51:19 2024 +0000

        chore: Update dependencies to fix potential risks and improve compatibility

    commit efb529552c5e4ba039a4cba8e9aa5cb7ba65bf90
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Wed Jun 19 10:25:58 2024 +0800

        Release llava-wilder

    commit 742651fc9daf97e2f57831ed6e6e7ee7ead7d555
    Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Date:   Wed Jun 19 07:44:26 2024 +0800

        feat: Add support for auto downloading tar format videos

    commit 511b6259828212fcba954cdeb8cf90d6e5daabf8
    Merge: 22a4958e 050b2c37
    Author: Bo Li <drluodian@gmail.com>
    Date:   Tue Jun 18 17:01:03 2024 +0000

        Merge branch 'main' of https://github.com/EvolvingLMMs-Lab/lmms-eval

    commit 050b2c370017e9b97475dd6cf01fd051b5ca5c86
    Merge: 74facb41 ef306512
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Jun 18 13:13:38 2024 +0800

        Merge pull request #114 from zjysteven/add-tinyllava

        add tinyllava

    commit ef306512e5135f76dffa383f600b8733015836e8
    Author: Jingyang Zhang <jingyang.zhang@duke.edu>
    Date:   Mon Jun 17 17:57:02 2024 -0400

        fix typo

    commit 9bab67732a4238097725deddf867fb1946ffee40
    Merge: dbfb2387 74facb41
    Author: Jingyang Zhang <jingyang.zhang@duke.edu>
    Date:   Sun Jun 16 10:56:05 2024 -0400

        Merge branch 'EvolvingLMMs-Lab:main' into add-tinyllava

    commit 74facb41a826691dfce4458cf1d8659b34fc5bf5
    Merge: 8ba192f9 d5df72de
    Author: Li Bo <drluodian@gmail.com>
    Date:   Sun Jun 16 17:59:19 2024 +0800

        Merge pull request #118 from teowu/main

        Fix the potential risk by PR #117

    commit d5df72de2d03108d6b365818ecc3551ac9aa6302
    Merge: 5bf59ed2 8ba192f9
    Author: Teo (Timothy) Wu Haoning <38696372+teowu@users.noreply.github.com>
    Date:   Sun Jun 16 15:32:13 2024 +0800

        Merge branch 'EvolvingLMMs-Lab:main' into main

    commit 5bf59ed250da98a408a94e214a73caa400cba842
    Author: teowu <realtimothyhwu@gmail.com>
    Date:   Sun Jun 16 07:27:28 2024 +0000

        fix #117, allow auto download with tar format videos

    commit 98b3955cb808e36303c030aea78eb037d1ec59ce
    Merge: a056f118 be9dada8
    Author: teowu <realtimothyhwu@gmail.com>
    Date:   Sun Jun 16 07:25:07 2024 +0000

        Merge branch 'main' of https://github.com/teowu/lmms-eval into main

    commit a056f118704eccec86ce32ab86981ce4bc1e1deb
    Author: teowu <realtimothyhwu@gmail.com>
    Date:   Sun Jun 16 07:23:54 2024 +0000

        fix #117, allow auto download with tar format videos

    commit 8ba192f94edf5d99598983445d5faa4f8807c49f
    Merge: 7cc28907 be9dada8
    Author: Li Bo <drluodian@gmail.com>
    Date:   Sat Jun 15 17:30:59 2024 +0800

        Merge pull request #117 from teowu/main

        LongVideoBench for LMMs-Eval

    commit be9dada8b4189c53c08e1674ab273242cf2f80a0
    Merge: 62ea8ceb 7cc28907
    Author: Teo (Timothy) Wu Haoning <38696372+teowu@users.noreply.github.com>
    Date:   Sat Jun 15 16:39:20 2024 +0800

        Merge pull request #1 from EvolvingLMMs-Lab/main

        Merge pull request #113 from teowu/main

    commit 62ea8ceb223ef2b51ebab2bcd50d5cf339c35cfe
    Author: teowu <realtimothyhwu@gmail.com>
    Date:   Sat Jun 15 08:30:11 2024 +0000

        LongVideoBench support: image LMMs (idefics2, phi3) and video LMMs (LLaVA-Next-Video-34B)

    commit 7cc28907edbb4eb58ee1398772a48110ea35dd96
    Merge: 4bc7224d ea14cd4b
    Author: Li Bo <drluodian@gmail.com>
    Date:   Sat Jun 15 14:10:22 2024 +0800

        Merge pull request #113 from teowu/main

        Q-Bench, Q-Bench2, A-Bench

    commit dbfb23873979f789477f4797ee2d6071e0fd921e
    Author: Jingyang <jingyang.zhang@duke.edu>
    Date:   Fri Jun 14 16:20:42 2024 -0400

        add tinyllava

    commit ea14cd4b361f4c95b3665cbdb95bc51754090eb5
    Author: teowu <realtimothyhwu@gmail.com>
    Date:   Fri Jun 14 15:01:52 2024 +0000

        Add qbench, qbench2, abench; fix phi3v as its current implementation does not support multi-image

    commit 4bc7224dcd27fe8b288bfc3fed4d7a9da9635658
    Merge: 2797987f bf14cb85
    Author: Li Bo <drluodian@gmail.com>
    Date:   Fri Jun 14 02:14:43 2024 +0800

        Merge pull request #111 from XinrunDu/main

        add II-Bench

    commit bf14cb8527b2b7ac438a36567a875168bc02d294
    Author: XinrunDu <duxinrun2000@gmail.com>
    Date:   Thu Jun 13 09:37:02 2024 +0000

        fix dataset_path

    commit 6248113f4e11a0ac396d31fa1b032a142fea8cb4
    Author: XinrunDu <duxinrun2000@gmail.com>
    Date:   Thu Jun 13 09:32:06 2024 +0000

        add II-Bench

    commit 2797987f5b88b87bd172714b678a75a1d8051826
    Merge: 63d82f1f 66d4bb2d
    Author: Li Bo <drluodian@gmail.com>
    Date:   Thu Jun 13 11:14:47 2024 +0800

        Merge pull request #109 from EvolvingLMMs-Lab/pufanyi/update_version

        [Small Update] Update the version of LMMs-Eval

    commit 66d4bb2d9c9afbbdea40196d4ad80e214d0b14b6
    Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Date:   Thu Jun 13 11:13:00 2024 +0800

        update version

    commit 63d82f1ff11eb430d91a15d6788a1f0b4d596850
    Author: Li Bo <drluodian@gmail.com>
    Date:   Thu Jun 13 11:04:32 2024 +0800

        Update README.md

    commit 44a33799671cb668f55366d5e5a4ddb051a3a1b4
    Merge: 5ed00356 0ce46d08
    Author: Li Bo <drluodian@gmail.com>
    Date:   Thu Jun 13 04:00:12 2024 +0800

        Merge pull request #105 from tianyu-z/main

        Include VCR

    commit 0ce46d088e473d12d63de44f17c67dceab25658c
    Author: Suyuchen <suyuchen.wang@umontreal.ca>
    Date:   Wed Jun 12 15:56:34 2024 -0400

        update README.md

    commit 46a88d8b0199ed44d2ff459fb372f2e006960cea
    Merge: 47b13b9b 5ed00356
    Author: Suyuchen <suyuchen.wang@umontreal.ca>
    Date:   Wed Jun 12 15:50:26 2024 -0400

        merged readme.md

    commit 47b13b9b320d36ac53b3622557e31239f7c22621
    Author: Suyuchen <suyuchen.wang@umontreal.ca>
    Date:   Wed Jun 12 15:30:52 2024 -0400

        update aggregation function for vcr_wiki

    commit 5ed00356676cf5d0ff056cf27d1b519b8e303ff7
    Author: Li Bo <drluodian@gmail.com>
    Date:   Thu Jun 13 03:21:42 2024 +0800

        Update README.md

    commit ed8806839db5988ced672bd162b7b046edb4863a
    Author: Li Bo <drluodian@gmail.com>
    Date:   Thu Jun 13 03:13:59 2024 +0800

        Update README.md

    commit fea3806026932a6e2bd6e538bcc413e33abdf245
    Merge: d99a24ab 05dc8e85
    Author: Li Bo <drluodian@gmail.com>
    Date:   Thu Jun 13 03:11:49 2024 +0800

        Merge pull request #108 from EvolvingLMMs-Lab/internal_main_dev

        [Upgrade to v0.2] Embracing Video Evaluations with LMMs-Eval

    commit 05dc8e853eab7c6bc782a1e2662d2efe7422f767
    Author: Bo Li <drluodian@gmail.com>
    Date:   Wed Jun 12 15:56:04 2024 +0000

        chore: Update lmms-eval to support video evaluations for LLaVA models

    commit cbeee20bc4ffb510a2b23d96cdaf4077be7c2a9e
    Author: Bo Li <drluodian@gmail.com>
    Date:   Wed Jun 12 15:50:30 2024 +0000

        chore: Update lmms-eval to support video evaluations for LLaVA models

    commit f00d5498b69dd4f7e54c907ac906abc7c128f000
    Author: Bo Li <drluodian@gmail.com>
    Date:   Wed Jun 12 15:46:33 2024 +0000

        Update image alignment in README.md

    commit 34156335db74cef9e3f0915d7172fd6b22456c15
    Author: Bo Li <drluodian@gmail.com>
    Date:   Wed Jun 12 15:43:16 2024 +0000

        Update llava conv_template in lmms_eval/models/llava.py

    commit 50575a950736bc8fc1e191310314cbb5fdff5720
    Author: Bo Li <drluodian@gmail.com>
    Date:   Wed Jun 12 15:39:03 2024 +0000

        chore: Update lmms-eval to support video evaluations for LLaVA models

    commit c9b2252fb8a15dd04252af5e6b4613855afd6ada
    Author: Bo Li <drluodian@gmail.com>
    Date:   Wed Jun 12 15:33:48 2024 +0000

        Bump version to 0.2.0.dev0

    commit 465bd4205e8097e9c037b24a3ed08dd6a7694efa
    Merge: e43bd840 d99a24ab
    Author: Bo Li <drluodian@gmail.com>
    Date:   Wed Jun 12 15:04:25 2024 +0000

        Merge branch 'main' of https://github.com/EvolvingLMMs-Lab/lmms-eval into internal_main_dev

    commit e43bd840b63eb499856e36d9d2ba45c924abcead
    Author: Bo Li <drluodian@gmail.com>
    Date:   Wed Jun 12 14:54:06 2024 +0000

        chore: Remove unnecessary files and code related to live_bench and sft_eval tasks

    commit d99a24abd06df10d07e5a4d0ad5030613f92f2e7
    Merge: 374590be a66003be
    Author: Li Bo <drluodian@gmail.com>
    Date:   Wed Jun 12 19:45:57 2024 +0800

        Merge pull request #107 from AtsuMiyai/new_task/upd_update

        update gpt-3.5-turbo version

    commit a66003befe4175824a1be6ed59f5f5b88c15f792
    Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
    Date:   Wed Jun 12 17:05:17 2024 +0900

        update gpt-3.5-turbo version

    commit ee91f272985f32eeb9cd6faa41afdd8eb49cac30
    Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
    Date:   Wed Jun 12 16:50:53 2024 +0900

        update gpt-3.5-turbo version

    commit 326b9694fc77398592b8caf3ba0bc2e2bb903813
    Author: tianyu-z <zhangtianyupro@gmail.com>
    Date:   Mon Jun 10 20:07:40 2024 -0400

        include std and confidence interval

    commit cd050d4a721d01a2ace0cd030cf7f8dc67eb8c4d
    Author: Suyuchen <suyuchen.wang@umontreal.ca>
    Date:   Mon Jun 10 18:49:47 2024 -0400

        update vcr_wiki tasks in README.md

    commit 205721e0aad76dde30255e56149bbed121883356
    Author: Suyuchen <suyuchen.wang@umontreal.ca>
    Date:   Mon Jun 10 18:43:15 2024 -0400

        update vcr_wiki tasks

    commit db8e718b502469e8536ee359c5559de87635ffc7
    Author: tianyu-z <zhangtianyupro@gmail.com>
    Date:   Mon Jun 10 16:13:58 2024 -0400

        include the try-except logic for spacy

    commit 427dabb790118f538b64e4e5bf6a7aab9689b3d9
    Author: Suyuchen <suyuchen.wang@umontreal.ca>
    Date:   Mon Jun 10 15:51:05 2024 -0400

        add crossed_text to vcr_wiki output

    commit 043b483eb55f7be4fea75c9bc0b9b03d251b109b
    Author: tianyu-z <zhangtianyupro@gmail.com>
    Date:   Mon Jun 10 15:47:00 2024 -0400

        switch logic

    commit e1f04db8f58dd10591fde335ea13f74cda7c79bd
    Author: tianyu-z <zhangtianyupro@gmail.com>
    Date:   Mon Jun 10 02:38:21 2024 -0400

        modify the form of VCR

    commit 96e8d9867c9549ab7490f4b12cfeb6a06238e0aa
    Author: tianyu-z <zhangtianyupro@gmail.com>
    Date:   Mon Jun 10 00:10:30 2024 -0400

        init include vcr

    commit 374590be62f988a76cf6704cfe394cd8ae7d4cb6
    Merge: 504685e2 cb3b9ce7
    Author: Kaichen Zhang - NTU <kaichenzhang358@outlook.com>
    Date:   Fri Jun 7 20:25:48 2024 +0800

        Merge pull request #101 from Gumpest/main

        Update conbench in README

    commit 504685e20b17659b913cf46f3012c16bf429e09d
    Author: Li Bo <drluodian@gmail.com>
    Date:   Thu Jun 6 15:42:15 2024 +0800

        Update README.md

    commit cb3b9ce71411da862ff01342a9122a3c656ffbd1
    Merge: c9793b38 67b64ea4
    Author: Yuan Zhang <56063339+Gumpest@users.noreply.github.com>
    Date:   Thu Jun 6 11:22:24 2024 +0800

        Merge branch 'EvolvingLMMs-Lab:main' into main

    commit c9793b3883714f254a700230b7bee781d6110e73
    Author: Yuan Zhang <gump_well_done@163.com>
    Date:   Thu Jun 6 11:21:05 2024 +0800

        update README

    commit 67b64ea44a5a39d96c7a196a8a8345a7486bd912
    Merge: 8ee7848a 5fd68451
    Author: Li Bo <drluodian@gmail.com>
    Date:   Wed Jun 5 23:12:58 2024 +0800

        Merge pull request #100 from Gumpest/main

        add Conbench

    commit 5fd684515c55ef643726c1b6c720c7cbd2183ba1
    Author: Yuan Zhang <gump_well_done@163.com>
    Date:   Wed Jun 5 21:52:31 2024 +0800

        add conbench

    commit 8ee7848aaa6383aa1f919c3f21199c81db3fff89
    Merge: 747e1978 6fefaf7c
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Jun 4 17:09:33 2024 +0800

        Merge pull request #95 from AtsuMiyai/new_task/upd

        add MM-UPD

    commit 747e19782996065cdce7157ee8c5e15beb5b6c59
    Merge: 4854a34d 05843072
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Jun 4 17:09:04 2024 +0800

        Merge pull request #97 from CaraJ7/update

        Add MathVerse in README.md

    commit 6fefaf7cea504e35583ee7217449da290295a7a4
    Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
    Date:   Tue Jun 4 17:36:39 2024 +0900

        update utils.py for leaderboard submission

    commit 5f4fe360def1c48ea0cb1da6409d192784882308
    Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
    Date:   Sun Jun 2 23:28:27 2024 +0900

        slightly change query_prompt for the reproduction

    commit 05843072d608b970bcada1cd0db65a3c80864060
    Author: CaraJ7 <1350074492@qq.com>
    Date:   Sun Jun 2 17:05:28 2024 +0800

        Add MathVerse in README.md

    commit 0581ab3cfb362e2024988b46fbbb00324f1233c9
    Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
    Date:   Fri May 31 16:09:45 2024 +0900

        merge model_specific_prompt_kwargs and dataset_name into each task yaml

    commit 4854a34d4d37efb5e201f2691ecdb054590cf20b
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Sat May 4 19:23:39 2024 +0800

        Group MMMU images into one image (#83)

        * update

        * update font

        * Add matplotlib.font_manager import in utils.py

        * Refactor font handling in add_order_label function in utils.py

        * group mmmu

        ---------

        Co-authored-by: Li Bo <drluodian@gmail.com>

    commit d224794c49520f4d28a31862cf977198cd6cbc5e
    Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
    Date:   Wed May 29 15:15:59 2024 +0900

        add upd

    commit 453e7936424220f02b99517059ca71babfbe5f5a
    Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
    Date:   Wed May 29 15:03:30 2024 +0900

        add upd

    commit 909edd6769ddcf8a546be4fdd129416687516878
    Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
    Date:   Wed May 29 12:52:21 2024 +0900

        add upd

    commit 7c1ac9706cafc4801fa4da181d2f610b7838c7b8
    Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
    Date:   Wed May 29 12:50:32 2024 +0900

        add upd

    commit 811301c5280ddd74986645086f026ab730c8848c
    Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
    Date:   Wed May 29 12:46:58 2024 +0900

        add upd

    commit 71401bafd1d515f704f86ab4817a758542bc4672
    Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
    Date:   Wed May 29 12:41:21 2024 +0900

        add upd

    commit 24dc435908d921e9f1a5706e3141b12e5d838d18
    Author: Bo Li <drluodian@gmail.com>
    Date:   Mon May 27 10:17:32 2024 +0000

        fix compatibility issue of older version llava

    commit 616edf43731415b35f0f5e97748ed2e017a2891d
    Author: Bo Li <drluodian@gmail.com>
    Date:   Mon May 27 09:32:26 2024 +0000

        [Fix] import issues of multilingual llava and olympiadbench

    commit 4c5a99e21a63fb0ee1c7d15546d18066e1d9894b
    Merge: 45c05b2b b05c3e22
    Author: Li Bo <drluodian@gmail.com>
    Date:   Mon May 27 14:19:53 2024 +0800

        Merge pull request #87 from vfragoso/vifragos/phi3v

        Adding microsoft/Phi-3-vision-128k-instruct model.

    commit b05c3e222fabd308dd7af4e04c1c6a0812962fe6
    Author: Victor Fragoso <victor.fragoso@microsoft.com>
    Date:   Fri May 24 16:36:37 2024 +0000

        Adding documentation of Phi3v class.

    commit c2008971308ce8168d57c24d00b725832f099244
    Author: Victor Fragoso <victor.fragoso@microsoft.com>
    Date:   Fri May 24 16:25:02 2024 +0000

        Adding prompt arguments for Phi3v on MathVista-TestMini

    commit 7f9fb6bcc6cd24a7b8011b8753d0ea98cc2451fd
    Author: Victor Fragoso <victor.fragoso@microsoft.com>
    Date:   Fri May 24 13:24:16 2024 +0000

        Adding Phi3v model.

    commit 45c05b2b2bece76e06849a52a0d034f9c0ac2367
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Thu May 23 03:47:36 2024 +0000

        Set printing info for llava_hf to debug level

    commit 53f013ed8278776551ca992562253387cc9968d2
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Thu May 23 03:41:39 2024 +0000

        Fix pope random name in pope full

    commit 22520a95f13334b75eee0cf0387151067a6bf516
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Thu May 23 03:41:14 2024 +0000

        Add separated pope tasks by category

    commit d1eefb1565014b47287ffa6b350229062f8f602f
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Thu May 9 08:36:02 2024 +0000

        Update gitignore

    commit b2b4dbd2dc13432c79208db35abf7f55c97f1790
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Mon May 20 07:45:11 2024 +0000

        Comment out Spice in caption task so that don't need to download stanford nlp model

    commit 662f05ce4c62a46a83f819d3a5925a9bd20059b5
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Mon May 20 03:13:13 2024 +0000

        Comment out parse result in xcomposer

    commit 09329322916bfbb604d72ddaf50441a0947f8805
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Thu May 16 03:55:39 2024 +0000

        Fix instructblip qformer size mismatch and multi-images problem

    commit 557a6a3b15e07e506bc05e2cc76ff6a2f8c93964
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Thu May 16 03:11:41 2024 +0000

        Remove redundant code in fuyu

    commit 6aeb5504e74ed1980b53700d8e4d4dcf7d1b38fc
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Thu May 16 01:45:24 2024 +0000

        Fix idefics2 llava in the wild bugs

    commit aea80e6a71f716951353e1e5d68380243396b4d6
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Wed May 15 11:07:35 2024 +0000

        Better task list_with_num

    commit 3c12a080d66b9c38f615b961befca7c30f82fa39
    Author: Li Bo <drluodian@gmail.com>
    Date:   Sat May 18 02:35:52 2024 +0800

        Update LICENSE

    commit 82317a635a4978b32e095a06cc295d0ae23661c2
    Author: Li Bo <drluodian@gmail.com>
    Date:   Sat May 18 02:29:09 2024 +0800

        Update LICENSE

    commit a8bba1cdb51061a0d27bf9a98cca1505b5c58ea5
    Author: Li Bo <drluodian@gmail.com>
    Date:   Sat May 18 02:28:03 2024 +0800

        Create LICENSE

    commit caa5893b5fd2c1d32c72b97f371ccd9a8d9ec3a0
    Merge: c0944486 423b0060
    Author: Li Bo <drluodian@gmail.com>
    Date:   Mon May 13 11:45:26 2024 +0800

        Merge pull request #73 from EvolvingLMMs-Lab/kc/qwen_vl_api

        [Feat] Add qwen vl api

    commit c09444860362a136f17641f8b2a1f91c2bbc3715
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Sat May 11 06:11:19 2024 +0000

        Fix llava_hf image tokens number issue

    commit 64f07e497f53e5bcbe9e8fb5830cc7a1daaf7ff1
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Thu May 9 02:04:10 2024 +0000

        Fix endless warning for llava_hf generation

    commit 8aaa828108da8514dd9cd23a9d6d83a8b67f2d65
    Author: Bo Li <drluodian@gmail.com>
    Date:   Thu May 2 06:13:56 2024 +0000

        Add model_name parameter to Llava constructor

    commit 7847dc4d8efe60605102414bb071b1da9851228e
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Tue May 7 03:15:59 2024 +0000

        Parse result for llava_hf 1.6

    commit 3e56b4f92db39a2ce92903b0c43a34f1d14d59ec
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Tue May 7 03:09:56 2024 +0000

        Fix llava_hf generation for 1.6

    commit fa3ff92b07ea5aaa633a2039818c310744f84d07
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Mon May 6 08:32:57 2024 +0000

        Fix llava conv template for llama3

    commit 423b00606aa77fd6b324c19e3d480b73ab852db6
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Sun May 5 07:54:52 2024 +0000

        Add qwen vl api

    commit b7fd7a9f7aa3c0e1e50374047dfffc46a7462b90
    Merge: 986139a9 c5a130b6
    Author: Li Bo <drluodian@gmail.com>
    Date:   Sun May 5 13:19:48 2024 +0800

        Merge pull request #59 from EvolvingLMMs-Lab/add_idefics2

        add idefics2

    commit 986139a9a31154679bdea029b09639f84712db27
    Merge: b46239ca 8d3526c0
    Author: Li Bo <drluodian@gmail.com>
    Date:   Fri May 3 01:18:18 2024 +0800

        Merge pull request #36 from cocoshe/main

        [Fix] repr llava doc

    commit b46239cabab7b545ec99d9eae6c851e531b18374
    Merge: bc69a744 373265f2
    Author: Li Bo <drluodian@gmail.com>
    Date:   Fri May 3 01:17:34 2024 +0800

        Merge pull request #56 from gagan3012/main

        Multilingual LLava bench

    commit bc69a744d2cffeb06eba62e843bcc7869e27613a
    Merge: eef3aeb6 626e8a91
    Author: Li Bo <drluodian@gmail.com>
    Date:   Fri May 3 01:12:14 2024 +0800

        Merge pull request #70 from hunterheiden/hsh/new_task/WebSRC

        Bugfix: WebSRC should be token-level F1 NOT character-level

    commit 626e8a91a4af2dd5dd774fc130cc2f4d74b2bc37
    Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
    Date:   Thu May 2 09:31:03 2024 -0400

        Bugfix: WebSRC should be token-level F1 NOT character-level

    commit eef3aeb6ab589bb1d5045af5b5c1984a69402d19
    Merge: c4e9dd9f 9bca4413
    Author: Li Bo <drluodian@gmail.com>
    Date:   Thu May 2 14:38:17 2024 +0800

        Merge pull request #69 from hunterheiden/hsh/new_task/WebSRC

        [New Task] WebSRC (multimodal Q&A on web screenshots)

    commit 9bca441376325173128e5c50087f068e519c48da
    Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
    Date:   Wed May 1 11:07:29 2024 -0400

        Add code to enable compilation of submission for WebSRC test split

    commit 7687495b1ed552eeba088cb9ad5aaf1170e7fff9
    Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
    Date:   Wed May 1 10:47:32 2024 -0400

        Draft and validate websrc eval on dev split

    commit 4eebd3e5d7ab3b8c3116eea57318db72d2ce32bb
    Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
    Date:   Wed May 1 10:46:54 2024 -0400

        Update main README with new task names

    commit 35fe80b67656114a8824eb59574089663bdc4c9a
    Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
    Date:   Wed May 1 10:46:20 2024 -0400

        Draft README for WebSRC

    commit 955bd0635cc6c14a96ad869f1002e6dbefdc5071
    Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
    Date:   Tue Apr 30 10:16:21 2024 -0400

        Init webSRC

    commit c4e9dd9f6e40e8586587c4a75987aa109a37f14b
    Merge: d8a3a99f 319afccb
    Author: Li Bo <drluodian@gmail.com>
    Date:   Fri Apr 26 14:37:22 2024 +0800

        Merge pull request #63 from hunterheiden/hsh/new_task/screenspot

        New Task: ScreenSpot - Grounding (REC) and instruction generation (REG) on screens

    commit 319afccbe713ddf40a8a6fa28501e64c0ad34725
    Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
    Date:   Thu Apr 25 11:44:34 2024 -0400

        slight update

    commit 2f3811ca1bbad6a441016b05fde09a571900fca8
    Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
    Date:   Thu Apr 25 11:41:04 2024 -0400

        Add README file specific to ScreenSpot

    commit 28962cbe83631ec5d6481aaea4907a7c96fec848
    Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
    Date:   Wed Apr 24 11:52:33 2024 -0400

        Update README to reflect new tasks

    commit e457cfb4f2d6869e8367d6d5b03ad25ee4acc363
    Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
    Date:   Tue Apr 23 18:33:16 2024 -0400

        Create ScreenSpot on clean branch

    commit d8a3a99ff6142fe101fa3c188cc7f29593c44345
    Merge: 3dcd0158 ed171293
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Apr 23 10:34:03 2024 +0800

        Merge pull request #61 from tupini07/patch-1

        Fix typo in Qwen-VL that was causing "reference before assignment"

    commit ed171293d1e82075c5c6a847fc91ecbfd45cf89f
    Author: Andrea Tupini <tupini07@gmail.com>
    Date:   Mon Apr 22 14:56:41 2024 -0600

        refactor query construction for clarity

    commit cd874201c46f32a2903ddffae85f9db73e14adfd
    Author: Andrea Tupini <tupini07@gmail.com>
    Date:   Mon Apr 22 14:54:29 2024 -0600

        convert contexts to list if necessary and remove unnecessary construction of `questions`

    commit 85573674e90c8d505312ba18c5102e0051255078
    Author: Andrea Tupini <tupini07@gmail.com>
    Date:   Mon Apr 22 14:47:33 2024 -0600

        Fix typo in qwen_vl that was causing "reference before assignment"

    commit 3dcd01582b719555bcf8eb25d91cc5e42abd2c5f
    Merge: 95df9fee 743673a1
    Author: Li Bo <drluodian@gmail.com>
    Date:   Sat Apr 20 22:03:16 2024 +0800

        Merge pull request #60 from CaraJ7/main

        Add MathVerse

    commit 743673a1419b6e729e18c96f148745cc739d4c71
    Merge: c1a54721 95df9fee
    Author: CaraJ7 <1350074492@qq.com>
    Date:   Sat Apr 20 21:49:02 2024 +0800

        Merge branch 'main' of https://github.com/EvolvingLMMs-Lab/lmms-eval

    commit c1a5472135c3b84061b64d997ab50dda0412ba4f
    Author: CaraJ7 <1350074492@qq.com>
    Date:   Sat Apr 20 21:45:34 2024 +0800

        Add MathVerse

    commit 373265f24e7a89cbd49ab724a2e388cc0930be78
    Author: Gagan Bhatia <49101362+gagan3012@users.noreply.github.com>
    Date:   Fri Apr 12 17:21:39 2024 -0700

        Add files via upload

    commit d8530514a5ef9378d2adeaceb228b60ec25a6718
    Author: Gagan Bhatia <49101362+gagan3012@users.noreply.github.com>
    Date:   Fri Apr 12 17:19:49 2024 -0700

        Create README.md

    commit 22a4958e993463edff352ac033014f9a485706cc
    Author: Bo Li <bo.li01@bytedance.com>
    Date:   Thu Apr 4 17:12:43 2024 +0000

        [WIP] adding mmbench dev evaluation (#75)

        * WIP

        * Update GPT evaluation model name and sys prompt

        * 🛠️ Scale accuracy to percentage

        The accuracy value is now multiplied by 100 in the aggregation function to represent it as a percentage. Regarding the evaluation process, `math` module importation and refactoring reduce progress log verbosity by logging every 100 evaluations instead of 10. It prevents potential logging overflow. Handling of NaN values is added to ensure 'default_value' is set in case of missing data, avoiding errors in split, category, and l2-category assignments. Finally, reporting of categorical and l2-categorical accuracies is streamlined through a new `calculate_hit_rates` function, improving code readability and maintenance.

        Issue refs: #1427, #1533

        * Update GPT evaluation model name and API configuration

        * Refactor MMBench_Evaluator class to handle missing columns

        * Add print statements for detailed results in MMBench-CN(CC), MMBench-CN(Dev), and MMBench-EN(Dev) evaluations

        * Refactor MMBench-CN and MMBench-EN evaluation functions

        * 🔄 Refactor result processing and logging logic

        - Simplified the result processing functions across different utility modules (`cc_utils.py`, `cn_utils.py`, `en_utils.py`) to unify the handling of multiple-choice options. Now, all options ("A" to "E") are dynamically added to the result data, and default to "nan" if not provided in the document.
        - Removed redundant keys directly from the process results dict creation to avoid clutter and align with the new dynamic addition of options.
        - In `mmbench_evals.py`, removed the unnecessary check for all splits being 'dev' and streamlined the evaluation loop by eliminating the progress bar (tqdm) for a cleaner log output.
        - Commented-out code and verbose logging during evaluation, which may have interfered with performance, has been removed for a more efficient and less intrusive logging experience.

        This cleanup reduces redundancy in the codebase and improves evaluation performance.

        Refs #2045

        ---------

        Co-authored-by: Bo Li <bo.li01@bytedance.com>
        (cherry picked from commit a19278c2ea6ddcbca64d3cc7f4efec7fe5775121)

    commit 8d3526c0869f0ad7747ff6bb02441140792b461c
    Author: cocoshe <1228759711@qq.com>
    Date:   Thu Mar 28 13:38:36 2024 +0800

        fix doc

    * feat: Add LlavaOneVision model to available models

    chore: Update sqlitedict dependency to version 2.1.0

    * Revert "Squashed commit of the following:"

    This reverts commit 11b00999df3c43cb225482e030b791b2d454124c.

    * Refactor available models in lmms_eval

    Remove duplicate entries for "llava_hf", "llava_onevision", and "longva" in the AVAILABLE_MODELS dictionary in lmms_eval/models/__init__.py.

    * fix: Handle import errors in lmms_eval models/__init__.py

    The code changes in this commit fix the handling of import errors in the lmms_eval/models/__init__.py file. Previously, when an import error occurred, the code simply ignored it. This commit updates the code to log an error message using the logger module when an import error occurs.

    This commit also removes duplicate entries for "llava_hf", "llava_onevision", and "longva" in the AVAILABLE_MODELS dictionary.

    Recent user commits:
    - Refactor available models in lmms_eval
    - Revert "Squashed commit of the following:"
    - feat: Add LlavaOneVision model to available models
    - chore: Update sqlitedict dependency to version 2.1.0

    * fix: Handle import errors in lmms_eval models/__init__.py

    * chore: Remove unused imports in lmms_eval/models/__init__.py and lmms_eval/tasks/vcr_wiki/utils.py

    * Remove unused imports in lmms_eval/tasks/vcr_wiki/utils.py

    * chore: Update lmms_eval/tasks/vcr_wiki/utils.py

    This commit updates the `lmms_eval/tasks/vcr_wiki/utils.py` file. It removes unused imports and fixes the condition for loading Spacy models based on the `load_package` value in the config file. Additionally, it adds a debug log message when the Spacy models are not loaded due to `load_package` being set to False.

    Remove unused imports in `lmms_eval/tasks/vcr_wiki/utils.py`

    * feat: Add new subtasks to overall score calculation

    The code changes in this commit add new subtasks to the overall score calculation in the `overall_score` function. The subtasks "ScanQA", "BLINK", "MathVerse", "SciVerse", and "Mantis" are included in the `categories` dictionary. This ensures that the scores for these subtasks are calculated and included in the evaluation results.

    Remove unused imports and update subtask categories in `utils.py`

    * feat: Add new subtasks to overall score calculation

    * chore: Update lmms_eval/tasks/llava_interleave_bench/_default_template_interleave_yaml

    Update the image aspect ratio in the default template for the llava_interleave_bench task. Change the value of "image_aspect_ratio" from "original" to "pad". This ensures that the generated images have a padded aspect ratio.

    * if no response directly return 0

    * Squashed commit of the following:

    commit b2a009b6bbf8353172f5a1dd9c29ea1f67610c02
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Mon Jul 15 19:12:25 2024 -0700

        if no response directly return 0 (#142)

    commit 5fc5f2f5acf454fc99448b0d62eb52b4bffba0d5
    Author: Kaichen Zhang - NTU <kaichenzhang358@outlook.com>
    Date:   Tue Jul 16 10:12:11 2024 +0800

        Add Muirbench (#143)

        * handle gen kwargs in internvl2

        * Add muirbench

    * Add files via upload

    (cherry picked from commit 557083a156c3dd67ac79e22b4202e9b69b6b00f4)

    * update

    ---------

    Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Co-authored-by: Yan Shu <570533048@qq.com>

commit b2a009b6bbf8353172f5a1dd9c29ea1f67610c02
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Mon Jul 15 19:12:25 2024 -0700

    if no response directly return 0 (#142)

commit 5fc5f2f5acf454fc99448b0d62eb52b4bffba0d5
Author: Kaichen Zhang - NTU <kaichenzhang358@outlook.com>
Date:   Tue Jul 16 10:12:11 2024 +0800

    Add Muirbench (#143)

    * handle gen kwargs in internvl2

    * Add muirbench

commit 4f8db1d37b1f824432927e74d6d82e06bb5aaed1
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Fri Jul 12 17:26:50 2024 -0700

    Upload live_bench results (#140)

    * upload results

    * add a readme

    * chore: Update upload_results.py script to use shell syntax

    * Update upload_results.py

    * Update upload_results.py

commit 18f3812c4f9af2e49af6b50e8afe7f607b8a75d6
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Jul 10 18:13:43 2024 -0700

    Load tasks only one time (#139)

    * chore: Initialize tasks only once to avoid re-initialization

    * chore: Initialize tasks only once to avoid re-initialization

    * chore: Refactor task initialization to avoid re-initialization

    * chore: Update task initialization to fix include_path issue

    * chore: Update task initialization to fix include_path issue

* chore: Remove unnecessary line in muirbench.yaml

* chore: Remove unnecessary line in muirbench.yaml and update gitignore

* chore: Update lmms_eval to use correct variable name for world size

* Update mmvet

* chore: Update lmms_eval to use correct variable name for world size

* chore: Remove unused lmms_eval configuration file

* refactor: Update lmms_eval to handle both image and video tasks

This commit updates the `Llava_OneVision` class in `llava_onevision.py` to handle both image and video tasks. It introduces conditional logic to differentiate between the two types of tasks and process the input accordingly. Additionally, it sets the image aspect ratio based on the number of visual inputs and the configuration settings.

Closes #123

* Fix llava onevision loglikelihood video bug

(cherry picked from commit f96e3e69fe86dcd9cb33d2bc18cc4ff2003de8be)

* refactor: Update mm_spatial_pool_mode to use bilinear interpolation

This commit updates the `mm_spatial_pool_mode` parameter in the `Llava_OneVision` class of `llava_onevision.py` to use bilinear interpolation instead of the previous average pooling mode. This change improves the spatial pooling process for the model.

Closes #456

* chore: Update pyproject.toml with protobuf dependency version 3.20

* Squashed commit of the following:

commit e106f49ceeb295fd4c89a0877073bc01b4b77c5f
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Thu Jul 25 08:14:03 2024 +0800

    livebench_july

commit a16295653fdda20d5e8c41c549d731ec422013e3
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Mon Jul 22 15:09:58 2024 +0800

    websites

commit 2cdc06ffe6ba53a4c707c1acf9fc5f2e7886b2b8
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Sun Jul 21 15:34:39 2024 +0800

    everything use gpt-4o

commit e67538d65526c58903d9e62d1914ebd39924ab67
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Sun Jul 21 14:29:55 2024 +0800

    chore: Update dataset capture settings in create_dataset.py

commit 0a3bb33d37cda05bb7bfba4ecf873c2860092a03
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Sun Jul 21 01:58:14 2024 +0800

    gpt-4-turbo => gpt-4o

commit 837f8b0400f04f4367f8f8f954afd64666d62fc6
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Sat Jul 20 16:48:04 2024 +0800

    chore: Update dataset name and version for live_bench task

commit fa58e730978b5536005c8bd0291abbeddd761205
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Sat Jul 20 15:05:13 2024 +0800

    generate data

commit faa96227a7af7bd6546578b2db68dce2acbc2c0c
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Sat Jul 20 13:15:18 2024 +0800

    fix

commit 60ea7ddb4fcd9f08013cd0d5b9dd8090f7e6b83e
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Sat Jul 20 13:12:31 2024 +0800

    fix bugs

commit 827d69d0bf967f5d69bfbee9848b4d568ca853b1
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Sat Jul 20 08:39:41 2024 +0800

    use claude to generate

commit b7e2619d1a51144cd434861ac151187aed82c8c4
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Sat Jul 20 07:36:59 2024 +0800

    extract information

commit f87d55d47cb0d6653765e9e3f988f4bc186f7d4c
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Sat Jul 20 07:24:07 2024 +0800

    claude auto detect json mode

commit dfdba507b5fbe985b0030ffec575f9f2638bc1ed
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Jul 16 11:13:52 2024 +0800

    merge ov evals (#144)

    * chore: Update gpt_eval_model_name to "gpt-3.5-turbo" in mathvista.yaml

    * Squashed commit of the following:

    commit 994c9f97a2f8db3e9b7d7933d1e1680acde5b70b
    Author: Yan Shu <570533048@qq.com>
    Date:   Mon Jul 8 17:21:23 2024 +0800

        Add files via upload

    * Squashed commit of the following:

    commit e31cd7883d4555c7530795c7f102b8d78cbd372f
    Author: Bo Li <drluodian@gmail.com>
    Date:   Wed Jul 10 12:08:08 2024 +1000

        chore: Update lmms_eval/models/vila.py and lmms_eval/tasks/__init__.py

    commit 1d8c980d1089f9d7702c3b92d5c85039f2809c6d
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Tue Jul 9 02:08:52 2024 +0000

        Rename xcomposer 4KHD

    commit 6da76f36ecf5f9aa73057e767a4fcb60c99ff896
    Author: Bo Li <drluodian@gmail.com>
    Date:   Tue Jul 9 11:55:56 2024 +1000

        Upgrade lmms-eval to version 0.2.1

    commit cd1858523fcd8630082cbefba8710e0de3ee8805
    Author: Bo Li <drluodian@gmail.com>
    Date:   Tue Jul 9 11:52:23 2024 +1000

        Upgrade lmms-eval to support more models and evaluation tasks

    commit 672d7e5bb49dcb34e1b2fdeb09f3f4588dc583a6
    Author: Bo Li <drluodian@gmail.com>
    Date:   Tue Jul 9 11:43:41 2024 +1000

        feat: Add tie_weights parameter to Llava model initialization

    commit 2037a86261b55fa42b8ba3a04eab192b3e69d6ea
    Merge: e6844db1 a5c18692
    Author: Bo Li <drluodian@gmail.com>
    Date:   Tue Jul 9 11:37:12 2024 +1000

        Fix gen kwargs image aspect ratio in internvl2

    commit a5c186925de989b616f58a35ece36065a32b4594
    Merge: 2ebec77f 557083a1
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Jul 9 09:15:56 2024 +0800

        Merge pull request #137 from shuyansy/main

        add MLVU task

    commit 557083a156c3dd67ac79e22b4202e9b69b6b00f4
    Author: Yan Shu <570533048@qq.com>
    Date:   Mon Jul 8 16:56:50 2024 +0800

        Add files via upload

    commit 2ebec77f5606d79e9a7b995970e32792050606a1
    Merge: 211bfede b23d349e
    Author: Li Bo <drluodian@gmail.com>
    Date:   Mon Jul 8 11:53:06 2024 +0800

        Merge pull request #136 from Dousia/main

        Add detailcaps

    commit b23d349e46d60dc149ffaa54d6e019f4996ed92d
    Author: ByteDance <bytedance@MacBook-Pro.local>
    Date:   Sun Jul 7 23:24:19 2024 +0800

        Add install capture_metric in env

    commit c6e211d5f9dbb7572d3a141b6504cb1ca2007c33
    Author: ByteDance <bytedance@MacBook-Pro.local>
    Date:   Sun Jul 7 23:04:13 2024 +0800

        Add detailcaps

    commit 211bfedebad243ef82a8b0be36c3b5a9b9cb2f72
    Merge: 7c208b76 79514eee
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Jul 2 23:05:12 2024 +0800

        Merge pull request #133 from EvolvingLMMs-Lab/dev/wild_vision

        Add wild vision bench

    commit 79514eeebcfd6f655be2a10c776037d12a7b7214
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Mon Jul 1 15:10:02 2024 +0000

        Fixing handling None filtered score

    commit 725fac2781446958b905e1e6c6eb3c0a8e582e49
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Mon Jul 1 08:25:42 2024 +0000

        Fixing dataset name

    commit 8d963e132ac03fc0d835d480cfcfcabe72af143c
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Mon Jul 1 08:24:51 2024 +0000

        Fixing scoring logic

    commit e2990d0a69e876721256fdf946c68ba7ae0cbdc1
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Mon Jul 1 06:06:57 2024 +0000

        Hardcode to keep image for wild vision

    commit ed381736730d8fb785b4ee919fdb751734ecef25
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Mon Jul 1 06:06:38 2024 +0000

        Add wild vision 0617

    commit 7c208b76640c986cfe94233dce735c3ca4ad4319
    Author: Li Bo <drluodian@gmail.com>
    Date:   Mon Jul 1 11:53:31 2024 +0800

        Update README.md

    commit 39d40dea47bc59ff04e8b0cbc445345098debc9a
    Merge: e19b43a3 ba7081c0
    Author: Li Bo <drluodian@gmail.com>
    Date:   Mon Jul 1 11:47:09 2024 +0800

        Merge pull request #129 from Dannoopsy/mmbench_ru

        add task MMBench-ru

    commit e19b43a3a1e7212e623061b164b0419cc0dda689
    Merge: 11fd7e3f a0de8970
    Author: Li Bo <drluodian@gmail.com>
    Date:   Mon Jul 1 11:46:58 2024 +0800

        Merge pull request #128 from Dannoopsy/gqa-ru

        add task gqa-ru

    commit 11fd7e3fc05908aeb01e4a6161a7b55cd38b3122
    Merge: 383e7fea a7522592
    Author: Li Bo <drluodian@gmail.com>
    Date:   Mon Jul 1 11:46:16 2024 +0800

        Merge pull request #130 from lscpku/vitatecs

        Add task VITATECS

    commit a75225926e5954f85466d257f99acf0163fde596
    Author: lscpku <lisc99@pku.edu.cn>
    Date:   Fri Jun 28 20:37:06 2024 +0800

        create new task vitatecs

    commit ba7081c0abac840002d320e30733e891298dfa11
    Author: Dannoopsy <63581325+Dannoopsy@users.noreply.github.com>
    Date:   Fri Jun 28 12:21:05 2024 +0300

        change prompt to ru

    commit 27ea9c0055a8abf3a8198829b8617018479918e2
    Author: Dannoopsy <belopolskikh.dd@phystech.edu>
    Date:   Thu Jun 27 17:17:29 2024 +0000

        add mmbench_ru_dev

    commit 383e7fead3138aedf62e9c0ec48303835ef26e2a
    Merge: 06fa000f ed2e7f79
    Author: Li Bo <drluodian@gmail.com>
    Date:   Fri Jun 28 00:14:10 2024 +0800

        Merge pull request #126 from lorenzomammana/feature/external-package-integration

        External package integration using plugins

    commit ed2e7f792151d21bce8f1c498270b9391e1d5c85
    Merge: 03947e14 06fa000f
    Author: Lorenzo Mammana <mammanalorenzo@outlook.it>
    Date:   Thu Jun 27 15:38:10 2024 +0000

        Merge branch 'main' into feature/external-package-integration

    commit a0de89708d5e6f259bb17f0eaace3c5b901b275c
    Author: Dannoopsy <belopolskikh.dd@phystech.edu>
    Date:   Tue Jun 25 11:11:37 2024 +0000

        new task gqa-ru

    commit 06fa000f60d3e4d160fac8ceb9959ae92a98f752
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Tue Jun 25 06:41:13 2024 +0000

        Fix vid mme post prompt issue

    commit b388d79e0df6f60068196cb7047453ebd22d6ef1
    Author: Li Bo <drluodian@gmail.com>
    Date:   Sun Jun 23 22:31:16 2024 +0800

        Update activitynetqa_generation.yaml

    commit 8f9d620fcd9d0a0742ee6bcf51ea63bd6b088a36
    Author: Li Bo <drluodian@gmail.com>
    Date:   Sun Jun 23 14:02:25 2024 +0800

        Update pyproject.toml

    commit 6341b7c15ce9fb28eb06b067ddb299d6cf2e16c3
    Merge: fce85f1b 903b042b
    Author: Li Bo <drluodian@gmail.com>
    Date:   Sun Jun 23 14:02:02 2024 +0800

        Merge pull request #125 from EvolvingLMMs-Lab/dev/interleave

        [Model] aligned llava-interleave model results on video tasks

    commit 903b042be016016d4ebeecb07701f3076a2d323c
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Sat Jun 22 12:07:13 2024 +0000

        Remove unnecessary lines for video llava

    commit d78ec86407b729a964906a8c2e50704b4bc74d06
    Merge: ebe7217a fce85f1b
    Author: Li Bo <drluodian@gmail.com>
    Date:   Sat Jun 22 13:57:31 2024 +0800

        Merge branch 'main' into dev/interleave

    commit ebe7217a486c1e754e42c2cbdb834e09fbbcc9b0
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Sat Jun 22 02:57:08 2024 +0000

        Delete unnecessary lines

    commit 120c474b056f9177c74e1fd9691d59e2f234b785
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Fri Jun 21 08:38:41 2024 +0000

        Revise model registry for llava_hf and longva

    commit 7d6201f921088afd3f52a35076e3c6fcc9aa518c
   …
kcz358 added a commit that referenced this pull request Sep 1, 2024
…s. (#218)

* Load tasks only one time (#139)

* chore: Initialize tasks only once to avoid re-initialization

* chore: Initialize tasks only once to avoid re-initialization

* chore: Refactor task initialization to avoid re-initialization

* chore: Update task initialization to fix include_path issue

* chore: Update task initialization to fix include_path issue

* Upload live_bench results (#140)

* upload results

* add a readme

* chore: Update upload_results.py script to use shell syntax

* Update upload_results.py

* Update upload_results.py

* Add Muirbench (#143)

* handle gen kwargs in internvl2

* Add muirbench

* if no response directly return 0 (#142)

* merge ov evals (#144)

* chore: Update gpt_eval_model_name to "gpt-3.5-turbo" in mathvista.yaml

* Squashed commit of the following:

commit 994c9f97a2f8db3e9b7d7933d1e1680acde5b70b
Author: Yan Shu <570533048@qq.com>
Date:   Mon Jul 8 17:21:23 2024 +0800

    Add files via upload

* Squashed commit of the following:

commit e31cd7883d4555c7530795c7f102b8d78cbd372f
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jul 10 12:08:08 2024 +1000

    chore: Update lmms_eval/models/vila.py and lmms_eval/tasks/__init__.py

commit 1d8c980d1089f9d7702c3b92d5c85039f2809c6d
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Tue Jul 9 02:08:52 2024 +0000

    Rename xcomposer 4KHD

commit 6da76f36ecf5f9aa73057e767a4fcb60c99ff896
Author: Bo Li <drluodian@gmail.com>
Date:   Tue Jul 9 11:55:56 2024 +1000

    Upgrade lmms-eval to version 0.2.1

commit cd1858523fcd8630082cbefba8710e0de3ee8805
Author: Bo Li <drluodian@gmail.com>
Date:   Tue Jul 9 11:52:23 2024 +1000

    Upgrade lmms-eval to support more models and evaluation tasks

commit 672d7e5bb49dcb34e1b2fdeb09f3f4588dc583a6
Author: Bo Li <drluodian@gmail.com>
Date:   Tue Jul 9 11:43:41 2024 +1000

    feat: Add tie_weights parameter to Llava model initialization

commit 2037a86261b55fa42b8ba3a04eab192b3e69d6ea
Merge: e6844db1 a5c18692
Author: Bo Li <drluodian@gmail.com>
Date:   Tue Jul 9 11:37:12 2024 +1000

    Fix gen kwargs image aspect ratio in internvl2

commit a5c186925de989b616f58a35ece36065a32b4594
Merge: 2ebec77f 557083a1
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Jul 9 09:15:56 2024 +0800

    Merge pull request #137 from shuyansy/main

    add MLVU task

commit 557083a156c3dd67ac79e22b4202e9b69b6b00f4
Author: Yan Shu <570533048@qq.com>
Date:   Mon Jul 8 16:56:50 2024 +0800

    Add files via upload

commit 2ebec77f5606d79e9a7b995970e32792050606a1
Merge: 211bfede b23d349e
Author: Li Bo <drluodian@gmail.com>
Date:   Mon Jul 8 11:53:06 2024 +0800

    Merge pull request #136 from Dousia/main

    Add detailcaps

commit b23d349e46d60dc149ffaa54d6e019f4996ed92d
Author: ByteDance <bytedance@MacBook-Pro.local>
Date:   Sun Jul 7 23:24:19 2024 +0800

    Add install capture_metric in env

commit c6e211d5f9dbb7572d3a141b6504cb1ca2007c33
Author: ByteDance <bytedance@MacBook-Pro.local>
Date:   Sun Jul 7 23:04:13 2024 +0800

    Add detailcaps

commit 211bfedebad243ef82a8b0be36c3b5a9b9cb2f72
Merge: 7c208b76 79514eee
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Jul 2 23:05:12 2024 +0800

    Merge pull request #133 from EvolvingLMMs-Lab/dev/wild_vision

    Add wild vision bench

commit 79514eeebcfd6f655be2a10c776037d12a7b7214
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Mon Jul 1 15:10:02 2024 +0000

    Fixing handling None filtered score

commit 725fac2781446958b905e1e6c6eb3c0a8e582e49
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Mon Jul 1 08:25:42 2024 +0000

    Fixing dataset name

commit 8d963e132ac03fc0d835d480cfcfcabe72af143c
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Mon Jul 1 08:24:51 2024 +0000

    Fixing scoring logic

commit e2990d0a69e876721256fdf946c68ba7ae0cbdc1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Mon Jul 1 06:06:57 2024 +0000

    Hardcode to keep image for wild vision

commit ed381736730d8fb785b4ee919fdb751734ecef25
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Mon Jul 1 06:06:38 2024 +0000

    Add wild vision 0617

commit 7c208b76640c986cfe94233dce735c3ca4ad4319
Author: Li Bo <drluodian@gmail.com>
Date:   Mon Jul 1 11:53:31 2024 +0800

    Update README.md

commit 39d40dea47bc59ff04e8b0cbc445345098debc9a
Merge: e19b43a3 ba7081c0
Author: Li Bo <drluodian@gmail.com>
Date:   Mon Jul 1 11:47:09 2024 +0800

    Merge pull request #129 from Dannoopsy/mmbench_ru

    add task MMBench-ru

commit e19b43a3a1e7212e623061b164b0419cc0dda689
Merge: 11fd7e3f a0de8970
Author: Li Bo <drluodian@gmail.com>
Date:   Mon Jul 1 11:46:58 2024 +0800

    Merge pull request #128 from Dannoopsy/gqa-ru

    add task gqa-ru

commit 11fd7e3fc05908aeb01e4a6161a7b55cd38b3122
Merge: 383e7fea a7522592
Author: Li Bo <drluodian@gmail.com>
Date:   Mon Jul 1 11:46:16 2024 +0800

    Merge pull request #130 from lscpku/vitatecs

    Add task VITATECS

commit a75225926e5954f85466d257f99acf0163fde596
Author: lscpku <lisc99@pku.edu.cn>
Date:   Fri Jun 28 20:37:06 2024 +0800

    create new task vitatecs

commit ba7081c0abac840002d320e30733e891298dfa11
Author: Dannoopsy <63581325+Dannoopsy@users.noreply.github.com>
Date:   Fri Jun 28 12:21:05 2024 +0300

    change prompt to ru

commit 27ea9c0055a8abf3a8198829b8617018479918e2
Author: Dannoopsy <belopolskikh.dd@phystech.edu>
Date:   Thu Jun 27 17:17:29 2024 +0000

    add mmbench_ru_dev

commit 383e7fead3138aedf62e9c0ec48303835ef26e2a
Merge: 06fa000f ed2e7f79
Author: Li Bo <drluodian@gmail.com>
Date:   Fri Jun 28 00:14:10 2024 +0800

    Merge pull request #126 from lorenzomammana/feature/external-package-integration

    External package integration using plugins

commit ed2e7f792151d21bce8f1c498270b9391e1d5c85
Merge: 03947e14 06fa000f
Author: Lorenzo Mammana <mammanalorenzo@outlook.it>
Date:   Thu Jun 27 15:38:10 2024 +0000

    Merge branch 'main' into feature/external-package-integration

commit a0de89708d5e6f259bb17f0eaace3c5b901b275c
Author: Dannoopsy <belopolskikh.dd@phystech.edu>
Date:   Tue Jun 25 11:11:37 2024 +0000

    new task gqa-ru

commit 06fa000f60d3e4d160fac8ceb9959ae92a98f752
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Tue Jun 25 06:41:13 2024 +0000

    Fix vid mme post prompt issue

commit b388d79e0df6f60068196cb7047453ebd22d6ef1
Author: Li Bo <drluodian@gmail.com>
Date:   Sun Jun 23 22:31:16 2024 +0800

    Update activitynetqa_generation.yaml

commit 8f9d620fcd9d0a0742ee6bcf51ea63bd6b088a36
Author: Li Bo <drluodian@gmail.com>
Date:   Sun Jun 23 14:02:25 2024 +0800

    Update pyproject.toml

commit 6341b7c15ce9fb28eb06b067ddb299d6cf2e16c3
Merge: fce85f1b 903b042b
Author: Li Bo <drluodian@gmail.com>
Date:   Sun Jun 23 14:02:02 2024 +0800

    Merge pull request #125 from EvolvingLMMs-Lab/dev/interleave

    [Model] aligned llava-interleave model results on video tasks

commit 903b042be016016d4ebeecb07701f3076a2d323c
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Jun 22 12:07:13 2024 +0000

    Remove unnecessary lines for video llava

commit d78ec86407b729a964906a8c2e50704b4bc74d06
Merge: ebe7217a fce85f1b
Author: Li Bo <drluodian@gmail.com>
Date:   Sat Jun 22 13:57:31 2024 +0800

    Merge branch 'main' into dev/interleave

commit ebe7217a486c1e754e42c2cbdb834e09fbbcc9b0
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Jun 22 02:57:08 2024 +0000

    Delete unnecessary lines

commit 120c474b056f9177c74e1fd9691d59e2f234b785
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Jun 21 08:38:41 2024 +0000

    Revise model registry for llava_hf and longva

commit 7d6201f921088afd3f52a35076e3c6fcc9aa518c
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Jun 21 08:38:24 2024 +0000

    Add longva

commit 12f480699c71a12a24d4349d9b0681933201a3a6
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Jun 21 08:35:39 2024 +0000

    Remove unnecessary lines since use batched visuals now in llava

commit 12cea76f1f0f14b1fd1007c9d39a9b0557368637
Author: Bo Li <drluodian@gmail.com>
Date:   Thu Jun 20 18:15:32 2024 +0000

    chore: Add loguru for logging in lmms_eval package

commit 03947e14a46fd25b412931f7c9c25f4a2971d0b4
Author: Lorenzo Mammana <mammanalorenzo@outlook.it>
Date:   Wed Jun 5 13:40:41 2024 +0000

    feat: Allow including external tasks from plugins

commit b80a91f73e15ddd0b0ce1322d7d121fa14030eed
Author: Lorenzo Mammana <mammanalorenzo@outlook.it>
Date:   Wed Jun 5 13:04:55 2024 +0000

    feat: Allow loading model configurations from other packages

commit 8ef24740dd48a11c97eb627f2fff4aca107fef0d
Author: Bo Li <drluodian@gmail.com>
Date:   Thu Jun 20 12:11:03 2024 +0000

    chore: Remove unused models from lmms_eval package

commit af38885fc2e066f5ea44388f33e07176f836fe28
Author: Bo Li <drluodian@gmail.com>
Date:   Thu Jun 20 12:07:09 2024 +0000

    chore: Handle ImportError when importing models

    Handle the ImportError exception when importing models in the lmms_eval package. This change adds a try-except block to catch the ImportError and print an error message indicating the failed import. This will help with troubleshooting and identifying any issues with the model imports.

commit fce85f1b03ff7043b29dee787c5d17a08dd2687a
Merge: dbe63293 d94f83cb
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 20 20:02:12 2024 +0800

    Merge pull request #120 from EvolvingLMMs-Lab/pufanyi/hf_dataset_docs

    Add docs for datasets upload to HF

commit dbe63293245a5141fdfd80bda7657c304f6bd32f
Author: choiszt <ls2001927@sohu.com>
Date:   Thu Jun 20 15:14:21 2024 +0800

    update ablation for videomme datasets

commit d94f83cb3f08b61a2c75cc4326e58792100605b3
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 20 13:30:59 2024 +0800

    Update README.md

commit cab8159ff35db330536c0b6dfb4b0a3b24142209
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 20 13:30:29 2024 +0800

    Update README.md

commit 45876652a877a8006b828f32f5cc4660629f9190
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu Jun 20 03:55:30 2024 +0000

    Add llava_hf back to registry

commit 3463651b8c54d36cd94169e3d376f5ed225a195a
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu Jun 20 03:54:33 2024 +0000

    Remove handling non-visual loop in llava

commit cb0d3f49b72790b081f981e0e6147131542f7f68
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Thu Jun 20 02:11:18 2024 +0800

    update readme

commit 813877bfe5ac590cdbe92dd74d18f83a2091f748
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Wed Jun 19 15:37:52 2024 +0800

    to sh script

commit a14684b8557d5894976448a5c559ed7a66a6cf16
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Wed Jun 19 15:37:04 2024 +0800

    lint

commit d0f8851d42ba31f5da2a7a65e91499db45174dbc
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Wed Jun 19 15:36:48 2024 +0800

    small fix

commit 63748e9718f287ad433afc90e340b5e17a89c1ed
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Wed Jun 19 15:36:43 2024 +0800

    small fix

commit 7f1159a1fe04cfb783dc31d4fbdef3bda0ce19e4
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Wed Jun 19 15:35:05 2024 +0800

    update preparation

commit 19f9bd621c76a483ff98f8c7eb78f64753da683a
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Wed Jun 19 15:23:24 2024 +0800

    docs

commit ce6f889ba02d819979c7922f6336cf4f1f718f65
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Wed Jun 19 15:04:16 2024 +0800

    tutorial

commit f513c520c2a3dad26d2b2ca5c4ed4db05a493c73
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jun 19 06:51:19 2024 +0000

    chore: Update dependencies to fix potential risks and improve compatibility

commit efb529552c5e4ba039a4cba8e9aa5cb7ba65bf90
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Wed Jun 19 10:25:58 2024 +0800

    Release llava-wilder

commit 742651fc9daf97e2f57831ed6e6e7ee7ead7d555
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Wed Jun 19 07:44:26 2024 +0800

    feat: Add support for auto downloading tar format videos

commit 511b6259828212fcba954cdeb8cf90d6e5daabf8
Merge: 22a4958e 050b2c37
Author: Bo Li <drluodian@gmail.com>
Date:   Tue Jun 18 17:01:03 2024 +0000

    Merge branch 'main' of https://github.com/EvolvingLMMs-Lab/lmms-eval

commit 050b2c370017e9b97475dd6cf01fd051b5ca5c86
Merge: 74facb41 ef306512
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Jun 18 13:13:38 2024 +0800

    Merge pull request #114 from zjysteven/add-tinyllava

    add tinyllava

commit ef306512e5135f76dffa383f600b8733015836e8
Author: Jingyang Zhang <jingyang.zhang@duke.edu>
Date:   Mon Jun 17 17:57:02 2024 -0400

    fix typo

commit 9bab67732a4238097725deddf867fb1946ffee40
Merge: dbfb2387 74facb41
Author: Jingyang Zhang <jingyang.zhang@duke.edu>
Date:   Sun Jun 16 10:56:05 2024 -0400

    Merge branch 'EvolvingLMMs-Lab:main' into add-tinyllava

commit 74facb41a826691dfce4458cf1d8659b34fc5bf5
Merge: 8ba192f9 d5df72de
Author: Li Bo <drluodian@gmail.com>
Date:   Sun Jun 16 17:59:19 2024 +0800

    Merge pull request #118 from teowu/main

    Fix the potential risk by PR #117

commit d5df72de2d03108d6b365818ecc3551ac9aa6302
Merge: 5bf59ed2 8ba192f9
Author: Teo (Timothy) Wu Haoning <38696372+teowu@users.noreply.github.com>
Date:   Sun Jun 16 15:32:13 2024 +0800

    Merge branch 'EvolvingLMMs-Lab:main' into main

commit 5bf59ed250da98a408a94e214a73caa400cba842
Author: teowu <realtimothyhwu@gmail.com>
Date:   Sun Jun 16 07:27:28 2024 +0000

    fix #117, allow auto download with tar format videos

commit 98b3955cb808e36303c030aea78eb037d1ec59ce
Merge: a056f118 be9dada8
Author: teowu <realtimothyhwu@gmail.com>
Date:   Sun Jun 16 07:25:07 2024 +0000

    Merge branch 'main' of https://github.com/teowu/lmms-eval into main

commit a056f118704eccec86ce32ab86981ce4bc1e1deb
Author: teowu <realtimothyhwu@gmail.com>
Date:   Sun Jun 16 07:23:54 2024 +0000

    fix #117, allow auto download with tar format videos

commit 8ba192f94edf5d99598983445d5faa4f8807c49f
Merge: 7cc28907 be9dada8
Author: Li Bo <drluodian@gmail.com>
Date:   Sat Jun 15 17:30:59 2024 +0800

    Merge pull request #117 from teowu/main

    LongVideoBench for LMMs-Eval

commit be9dada8b4189c53c08e1674ab273242cf2f80a0
Merge: 62ea8ceb 7cc28907
Author: Teo (Timothy) Wu Haoning <38696372+teowu@users.noreply.github.com>
Date:   Sat Jun 15 16:39:20 2024 +0800

    Merge pull request #1 from EvolvingLMMs-Lab/main

    Merge pull request #113 from teowu/main

commit 62ea8ceb223ef2b51ebab2bcd50d5cf339c35cfe
Author: teowu <realtimothyhwu@gmail.com>
Date:   Sat Jun 15 08:30:11 2024 +0000

    LongVideoBench support: image LMMs (idefics2, phi3) and video LMMs (LLaVA-Next-Video-34B)

commit 7cc28907edbb4eb58ee1398772a48110ea35dd96
Merge: 4bc7224d ea14cd4b
Author: Li Bo <drluodian@gmail.com>
Date:   Sat Jun 15 14:10:22 2024 +0800

    Merge pull request #113 from teowu/main

    Q-Bench, Q-Bench2, A-Bench

commit dbfb23873979f789477f4797ee2d6071e0fd921e
Author: Jingyang <jingyang.zhang@duke.edu>
Date:   Fri Jun 14 16:20:42 2024 -0400

    add tinyllava

commit ea14cd4b361f4c95b3665cbdb95bc51754090eb5
Author: teowu <realtimothyhwu@gmail.com>
Date:   Fri Jun 14 15:01:52 2024 +0000

    Add qbench, qbench2, abench; fix phi3v as its current implementation does not support multi-image

commit 4bc7224dcd27fe8b288bfc3fed4d7a9da9635658
Merge: 2797987f bf14cb85
Author: Li Bo <drluodian@gmail.com>
Date:   Fri Jun 14 02:14:43 2024 +0800

    Merge pull request #111 from XinrunDu/main

    add II-Bench

commit bf14cb8527b2b7ac438a36567a875168bc02d294
Author: XinrunDu <duxinrun2000@gmail.com>
Date:   Thu Jun 13 09:37:02 2024 +0000

    fix dataset_path

commit 6248113f4e11a0ac396d31fa1b032a142fea8cb4
Author: XinrunDu <duxinrun2000@gmail.com>
Date:   Thu Jun 13 09:32:06 2024 +0000

    add II-Bench

commit 2797987f5b88b87bd172714b678a75a1d8051826
Merge: 63d82f1f 66d4bb2d
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 13 11:14:47 2024 +0800

    Merge pull request #109 from EvolvingLMMs-Lab/pufanyi/update_version

    [Small Update] Update the version of LMMs-Eval

commit 66d4bb2d9c9afbbdea40196d4ad80e214d0b14b6
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Thu Jun 13 11:13:00 2024 +0800

    update version

commit 63d82f1ff11eb430d91a15d6788a1f0b4d596850
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 13 11:04:32 2024 +0800

    Update README.md

commit 44a33799671cb668f55366d5e5a4ddb051a3a1b4
Merge: 5ed00356 0ce46d08
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 13 04:00:12 2024 +0800

    Merge pull request #105 from tianyu-z/main

    Include VCR

commit 0ce46d088e473d12d63de44f17c67dceab25658c
Author: Suyuchen <suyuchen.wang@umontreal.ca>
Date:   Wed Jun 12 15:56:34 2024 -0400

    update README.md

commit 46a88d8b0199ed44d2ff459fb372f2e006960cea
Merge: 47b13b9b 5ed00356
Author: Suyuchen <suyuchen.wang@umontreal.ca>
Date:   Wed Jun 12 15:50:26 2024 -0400

    merged readme.md

commit 47b13b9b320d36ac53b3622557e31239f7c22621
Author: Suyuchen <suyuchen.wang@umontreal.ca>
Date:   Wed Jun 12 15:30:52 2024 -0400

    update aggregation function for vcr_wiki

commit 5ed00356676cf5d0ff056cf27d1b519b8e303ff7
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 13 03:21:42 2024 +0800

    Update README.md

commit ed8806839db5988ced672bd162b7b046edb4863a
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 13 03:13:59 2024 +0800

    Update README.md

commit fea3806026932a6e2bd6e538bcc413e33abdf245
Merge: d99a24ab 05dc8e85
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 13 03:11:49 2024 +0800

    Merge pull request #108 from EvolvingLMMs-Lab/internal_main_dev

    [Upgrade to v0.2] Embracing Video Evaluations with LMMs-Eval

commit 05dc8e853eab7c6bc782a1e2662d2efe7422f767
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jun 12 15:56:04 2024 +0000

    chore: Update lmms-eval to support video evaluations for LLaVA models

commit cbeee20bc4ffb510a2b23d96cdaf4077be7c2a9e
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jun 12 15:50:30 2024 +0000

    chore: Update lmms-eval to support video evaluations for LLaVA models

commit f00d5498b69dd4f7e54c907ac906abc7c128f000
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jun 12 15:46:33 2024 +0000

    Update image alignment in README.md

commit 34156335db74cef9e3f0915d7172fd6b22456c15
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jun 12 15:43:16 2024 +0000

    Update llava conv_template in lmms_eval/models/llava.py

commit 50575a950736bc8fc1e191310314cbb5fdff5720
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jun 12 15:39:03 2024 +0000

    chore: Update lmms-eval to support video evaluations for LLaVA models

commit c9b2252fb8a15dd04252af5e6b4613855afd6ada
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jun 12 15:33:48 2024 +0000

    Bump version to 0.2.0.dev0

commit 465bd4205e8097e9c037b24a3ed08dd6a7694efa
Merge: e43bd840 d99a24ab
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jun 12 15:04:25 2024 +0000

    Merge branch 'main' of https://github.com/EvolvingLMMs-Lab/lmms-eval into internal_main_dev

commit e43bd840b63eb499856e36d9d2ba45c924abcead
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jun 12 14:54:06 2024 +0000

    chore: Remove unnecessary files and code related to live_bench and sft_eval tasks

commit d99a24abd06df10d07e5a4d0ad5030613f92f2e7
Merge: 374590be a66003be
Author: Li Bo <drluodian@gmail.com>
Date:   Wed Jun 12 19:45:57 2024 +0800

    Merge pull request #107 from AtsuMiyai/new_task/upd_update

    update gpt-3.5-turbo version

commit a66003befe4175824a1be6ed59f5f5b88c15f792
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Wed Jun 12 17:05:17 2024 +0900

    update gpt-3.5-turbo version

commit ee91f272985f32eeb9cd6faa41afdd8eb49cac30
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Wed Jun 12 16:50:53 2024 +0900

    update gpt-3.5-turbo version

commit 326b9694fc77398592b8caf3ba0bc2e2bb903813
Author: tianyu-z <zhangtianyupro@gmail.com>
Date:   Mon Jun 10 20:07:40 2024 -0400

    include std and confidence interval

commit cd050d4a721d01a2ace0cd030cf7f8dc67eb8c4d
Author: Suyuchen <suyuchen.wang@umontreal.ca>
Date:   Mon Jun 10 18:49:47 2024 -0400

    update vcr_wiki tasks in README.md

commit 205721e0aad76dde30255e56149bbed121883356
Author: Suyuchen <suyuchen.wang@umontreal.ca>
Date:   Mon Jun 10 18:43:15 2024 -0400

    update vcr_wiki tasks

commit db8e718b502469e8536ee359c5559de87635ffc7
Author: tianyu-z <zhangtianyupro@gmail.com>
Date:   Mon Jun 10 16:13:58 2024 -0400

    include the try-except logic for spacy

commit 427dabb790118f538b64e4e5bf6a7aab9689b3d9
Author: Suyuchen <suyuchen.wang@umontreal.ca>
Date:   Mon Jun 10 15:51:05 2024 -0400

    add crossed_text to vcr_wiki output

commit 043b483eb55f7be4fea75c9bc0b9b03d251b109b
Author: tianyu-z <zhangtianyupro@gmail.com>
Date:   Mon Jun 10 15:47:00 2024 -0400

    switch logic

commit e1f04db8f58dd10591fde335ea13f74cda7c79bd
Author: tianyu-z <zhangtianyupro@gmail.com>
Date:   Mon Jun 10 02:38:21 2024 -0400

    modify the form of VCR

commit 96e8d9867c9549ab7490f4b12cfeb6a06238e0aa
Author: tianyu-z <zhangtianyupro@gmail.com>
Date:   Mon Jun 10 00:10:30 2024 -0400

    init include vcr

commit 374590be62f988a76cf6704cfe394cd8ae7d4cb6
Merge: 504685e2 cb3b9ce7
Author: Kaichen Zhang - NTU <kaichenzhang358@outlook.com>
Date:   Fri Jun 7 20:25:48 2024 +0800

    Merge pull request #101 from Gumpest/main

    Update conbench in README

commit 504685e20b17659b913cf46f3012c16bf429e09d
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 6 15:42:15 2024 +0800

    Update README.md

commit cb3b9ce71411da862ff01342a9122a3c656ffbd1
Merge: c9793b38 67b64ea4
Author: Yuan Zhang <56063339+Gumpest@users.noreply.github.com>
Date:   Thu Jun 6 11:22:24 2024 +0800

    Merge branch 'EvolvingLMMs-Lab:main' into main

commit c9793b3883714f254a700230b7bee781d6110e73
Author: Yuan Zhang <gump_well_done@163.com>
Date:   Thu Jun 6 11:21:05 2024 +0800

    update README

commit 67b64ea44a5a39d96c7a196a8a8345a7486bd912
Merge: 8ee7848a 5fd68451
Author: Li Bo <drluodian@gmail.com>
Date:   Wed Jun 5 23:12:58 2024 +0800

    Merge pull request #100 from Gumpest/main

    add Conbench

commit 5fd684515c55ef643726c1b6c720c7cbd2183ba1
Author: Yuan Zhang <gump_well_done@163.com>
Date:   Wed Jun 5 21:52:31 2024 +0800

    add conbench

commit 8ee7848aaa6383aa1f919c3f21199c81db3fff89
Merge: 747e1978 6fefaf7c
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Jun 4 17:09:33 2024 +0800

    Merge pull request #95 from AtsuMiyai/new_task/upd

    add MM-UPD

commit 747e19782996065cdce7157ee8c5e15beb5b6c59
Merge: 4854a34d 05843072
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Jun 4 17:09:04 2024 +0800

    Merge pull request #97 from CaraJ7/update

    Add MathVerse in README.md

commit 6fefaf7cea504e35583ee7217449da290295a7a4
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Tue Jun 4 17:36:39 2024 +0900

    update utils.py for leaderboard submission

commit 5f4fe360def1c48ea0cb1da6409d192784882308
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Sun Jun 2 23:28:27 2024 +0900

    slightly change query_prompt for the reproduction

commit 05843072d608b970bcada1cd0db65a3c80864060
Author: CaraJ7 <1350074492@qq.com>
Date:   Sun Jun 2 17:05:28 2024 +0800

    Add MathVerse in README.md

commit 0581ab3cfb362e2024988b46fbbb00324f1233c9
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Fri May 31 16:09:45 2024 +0900

    merge model_specific_prompt_kwargs and dataset_name into each task yaml

commit 4854a34d4d37efb5e201f2691ecdb054590cf20b
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Sat May 4 19:23:39 2024 +0800

    Group MMMU images into one image (#83)

    * update

    * update font

    * Add matplotlib.font_manager import in utils.py

    * Refactor font handling in add_order_label function in utils.py

    * group mmmu

    ---------

    Co-authored-by: Li Bo <drluodian@gmail.com>

commit d224794c49520f4d28a31862cf977198cd6cbc5e
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Wed May 29 15:15:59 2024 +0900

    add upd

commit 453e7936424220f02b99517059ca71babfbe5f5a
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Wed May 29 15:03:30 2024 +0900

    add upd

commit 909edd6769ddcf8a546be4fdd129416687516878
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Wed May 29 12:52:21 2024 +0900

    add upd

commit 7c1ac9706cafc4801fa4da181d2f610b7838c7b8
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Wed May 29 12:50:32 2024 +0900

    add upd

commit 811301c5280ddd74986645086f026ab730c8848c
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Wed May 29 12:46:58 2024 +0900

    add upd

commit 71401bafd1d515f704f86ab4817a758542bc4672
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Wed May 29 12:41:21 2024 +0900

    add upd

commit 24dc435908d921e9f1a5706e3141b12e5d838d18
Author: Bo Li <drluodian@gmail.com>
Date:   Mon May 27 10:17:32 2024 +0000

    fix compatibility issue of older version llava

commit 616edf43731415b35f0f5e97748ed2e017a2891d
Author: Bo Li <drluodian@gmail.com>
Date:   Mon May 27 09:32:26 2024 +0000

    [Fix] import issues of multilingual llava and olympiadbench

commit 4c5a99e21a63fb0ee1c7d15546d18066e1d9894b
Merge: 45c05b2b b05c3e22
Author: Li Bo <drluodian@gmail.com>
Date:   Mon May 27 14:19:53 2024 +0800

    Merge pull request #87 from vfragoso/vifragos/phi3v

    Adding microsoft/Phi-3-vision-128k-instruct model.

commit b05c3e222fabd308dd7af4e04c1c6a0812962fe6
Author: Victor Fragoso <victor.fragoso@microsoft.com>
Date:   Fri May 24 16:36:37 2024 +0000

    Adding documentation of Phi3v class.

commit c2008971308ce8168d57c24d00b725832f099244
Author: Victor Fragoso <victor.fragoso@microsoft.com>
Date:   Fri May 24 16:25:02 2024 +0000

    Adding prompt arguments for Phi3v on MathVista-TestMini

commit 7f9fb6bcc6cd24a7b8011b8753d0ea98cc2451fd
Author: Victor Fragoso <victor.fragoso@microsoft.com>
Date:   Fri May 24 13:24:16 2024 +0000

    Adding Phi3v model.

commit 45c05b2b2bece76e06849a52a0d034f9c0ac2367
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu May 23 03:47:36 2024 +0000

    Set printing info for llava_hf to debug level

commit 53f013ed8278776551ca992562253387cc9968d2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu May 23 03:41:39 2024 +0000

    Fix pope random name in pope full

commit 22520a95f13334b75eee0cf0387151067a6bf516
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu May 23 03:41:14 2024 +0000

    Add separated pope tasks by category

commit d1eefb1565014b47287ffa6b350229062f8f602f
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu May 9 08:36:02 2024 +0000

    Update gitignore

commit b2b4dbd2dc13432c79208db35abf7f55c97f1790
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Mon May 20 07:45:11 2024 +0000

    Comment out Spice in caption task so that don't need to download stanford nlp model

commit 662f05ce4c62a46a83f819d3a5925a9bd20059b5
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Mon May 20 03:13:13 2024 +0000

    Comment out parse result in xcomposer

commit 09329322916bfbb604d72ddaf50441a0947f8805
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu May 16 03:55:39 2024 +0000

    Fix instructblip qformer size mismatch and multi-images problem

commit 557a6a3b15e07e506bc05e2cc76ff6a2f8c93964
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu May 16 03:11:41 2024 +0000

    Remove redundant code in fuyu

commit 6aeb5504e74ed1980b53700d8e4d4dcf7d1b38fc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu May 16 01:45:24 2024 +0000

    Fix idefics2 llava in the wild bugs

commit aea80e6a71f716951353e1e5d68380243396b4d6
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Wed May 15 11:07:35 2024 +0000

    Better task list_with_num

commit 3c12a080d66b9c38f615b961befca7c30f82fa39
Author: Li Bo <drluodian@gmail.com>
Date:   Sat May 18 02:35:52 2024 +0800

    Update LICENSE

commit 82317a635a4978b32e095a06cc295d0ae23661c2
Author: Li Bo <drluodian@gmail.com>
Date:   Sat May 18 02:29:09 2024 +0800

    Update LICENSE

commit a8bba1cdb51061a0d27bf9a98cca1505b5c58ea5
Author: Li Bo <drluodian@gmail.com>
Date:   Sat May 18 02:28:03 2024 +0800

    Create LICENSE

commit caa5893b5fd2c1d32c72b97f371ccd9a8d9ec3a0
Merge: c0944486 423b0060
Author: Li Bo <drluodian@gmail.com>
Date:   Mon May 13 11:45:26 2024 +0800

    Merge pull request #73 from EvolvingLMMs-Lab/kc/qwen_vl_api

    [Feat] Add qwen vl api

commit c09444860362a136f17641f8b2a1f91c2bbc3715
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat May 11 06:11:19 2024 +0000

    Fix llava_hf image tokens number issue

commit 64f07e497f53e5bcbe9e8fb5830cc7a1daaf7ff1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu May 9 02:04:10 2024 +0000

    Fix endless warning for llava_hf generation

commit 8aaa828108da8514dd9cd23a9d6d83a8b67f2d65
Author: Bo Li <drluodian@gmail.com>
Date:   Thu May 2 06:13:56 2024 +0000

    Add model_name parameter to Llava constructor

commit 7847dc4d8efe60605102414bb071b1da9851228e
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Tue May 7 03:15:59 2024 +0000

    Parse result for llava_hf 1.6

commit 3e56b4f92db39a2ce92903b0c43a34f1d14d59ec
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Tue May 7 03:09:56 2024 +0000

    Fix llava_hf generation for 1.6

commit fa3ff92b07ea5aaa633a2039818c310744f84d07
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Mon May 6 08:32:57 2024 +0000

    Fix llava conv template for llama3

commit 423b00606aa77fd6b324c19e3d480b73ab852db6
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun May 5 07:54:52 2024 +0000

    Add qwen vl api

commit b7fd7a9f7aa3c0e1e50374047dfffc46a7462b90
Merge: 986139a9 c5a130b6
Author: Li Bo <drluodian@gmail.com>
Date:   Sun May 5 13:19:48 2024 +0800

    Merge pull request #59 from EvolvingLMMs-Lab/add_idefics2

    add idefics2

commit 986139a9a31154679bdea029b09639f84712db27
Merge: b46239ca 8d3526c0
Author: Li Bo <drluodian@gmail.com>
Date:   Fri May 3 01:18:18 2024 +0800

    Merge pull request #36 from cocoshe/main

    [Fix] repr llava doc

commit b46239cabab7b545ec99d9eae6c851e531b18374
Merge: bc69a744 373265f2
Author: Li Bo <drluodian@gmail.com>
Date:   Fri May 3 01:17:34 2024 +0800

    Merge pull request #56 from gagan3012/main

    Multilingual LLava bench

commit bc69a744d2cffeb06eba62e843bcc7869e27613a
Merge: eef3aeb6 626e8a91
Author: Li Bo <drluodian@gmail.com>
Date:   Fri May 3 01:12:14 2024 +0800

    Merge pull request #70 from hunterheiden/hsh/new_task/WebSRC

    Bugfix: WebSRC should be token-level F1 NOT character-level

commit 626e8a91a4af2dd5dd774fc130cc2f4d74b2bc37
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Thu May 2 09:31:03 2024 -0400

    Bugfix: WebSRC should be token-level F1 NOT character-level

commit eef3aeb6ab589bb1d5045af5b5c1984a69402d19
Merge: c4e9dd9f 9bca4413
Author: Li Bo <drluodian@gmail.com>
Date:   Thu May 2 14:38:17 2024 +0800

    Merge pull request #69 from hunterheiden/hsh/new_task/WebSRC

    [New Task] WebSRC (multimodal Q&A on web screenshots)

commit 9bca441376325173128e5c50087f068e519c48da
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Wed May 1 11:07:29 2024 -0400

    Add code to enable compilation of submission for WebSRC test split

commit 7687495b1ed552eeba088cb9ad5aaf1170e7fff9
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Wed May 1 10:47:32 2024 -0400

    Draft and validate websrc eval on dev split

commit 4eebd3e5d7ab3b8c3116eea57318db72d2ce32bb
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Wed May 1 10:46:54 2024 -0400

    Update main README with new task names

commit 35fe80b67656114a8824eb59574089663bdc4c9a
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Wed May 1 10:46:20 2024 -0400

    Draft README for WebSRC

commit 955bd0635cc6c14a96ad869f1002e6dbefdc5071
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Tue Apr 30 10:16:21 2024 -0400

    Init webSRC

commit c4e9dd9f6e40e8586587c4a75987aa109a37f14b
Merge: d8a3a99f 319afccb
Author: Li Bo <drluodian@gmail.com>
Date:   Fri Apr 26 14:37:22 2024 +0800

    Merge pull request #63 from hunterheiden/hsh/new_task/screenspot

    New Task: ScreenSpot - Grounding (REC) and instruction generation (REG) on screens

commit 319afccbe713ddf40a8a6fa28501e64c0ad34725
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Thu Apr 25 11:44:34 2024 -0400

    slight update

commit 2f3811ca1bbad6a441016b05fde09a571900fca8
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Thu Apr 25 11:41:04 2024 -0400

    Add README file specific to ScreenSpot

commit 28962cbe83631ec5d6481aaea4907a7c96fec848
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Wed Apr 24 11:52:33 2024 -0400

    Update README to reflect new tasks

commit e457cfb4f2d6869e8367d6d5b03ad25ee4acc363
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Tue Apr 23 18:33:16 2024 -0400

    Create ScreenSpot on clean branch

commit d8a3a99ff6142fe101fa3c188cc7f29593c44345
Merge: 3dcd0158 ed171293
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Apr 23 10:34:03 2024 +0800

    Merge pull request #61 from tupini07/patch-1

    Fix typo in Qwen-VL that was causing "reference before assignment"

commit ed171293d1e82075c5c6a847fc91ecbfd45cf89f
Author: Andrea Tupini <tupini07@gmail.com>
Date:   Mon Apr 22 14:56:41 2024 -0600

    refactor query construction for clarity

commit cd874201c46f32a2903ddffae85f9db73e14adfd
Author: Andrea Tupini <tupini07@gmail.com>
Date:   Mon Apr 22 14:54:29 2024 -0600

    convert contexts to list if necessary and remove unnecessary construction of `questions`

commit 85573674e90c8d505312ba18c5102e0051255078
Author: Andrea Tupini <tupini07@gmail.com>
Date:   Mon Apr 22 14:47:33 2024 -0600

    Fix typo in qwen_vl that was causing "reference before assignment"

commit 3dcd01582b719555bcf8eb25d91cc5e42abd2c5f
Merge: 95df9fee 743673a1
Author: Li Bo <drluodian@gmail.com>
Date:   Sat Apr 20 22:03:16 2024 +0800

    Merge pull request #60 from CaraJ7/main

    Add MathVerse

commit 743673a1419b6e729e18c96f148745cc739d4c71
Merge: c1a54721 95df9fee
Author: CaraJ7 <1350074492@qq.com>
Date:   Sat Apr 20 21:49:02 2024 +0800

    Merge branch 'main' of https://github.com/EvolvingLMMs-Lab/lmms-eval

commit c1a5472135c3b84061b64d997ab50dda0412ba4f
Author: CaraJ7 <1350074492@qq.com>
Date:   Sat Apr 20 21:45:34 2024 +0800

    Add MathVerse

commit 373265f24e7a89cbd49ab724a2e388cc0930be78
Author: Gagan Bhatia <49101362+gagan3012@users.noreply.github.com>
Date:   Fri Apr 12 17:21:39 2024 -0700

    Add files via upload

commit d8530514a5ef9378d2adeaceb228b60ec25a6718
Author: Gagan Bhatia <49101362+gagan3012@users.noreply.github.com>
Date:   Fri Apr 12 17:19:49 2024 -0700

    Create README.md

commit 22a4958e993463edff352ac033014f9a485706cc
Author: Bo Li <bo.li01@bytedance.com>
Date:   Thu Apr 4 17:12:43 2024 +0000

    [WIP] adding mmbench dev evaluation (#75)

    * WIP

    * Update GPT evaluation model name and sys prompt

    * 🛠️ Scale accuracy to percentage

    The accuracy value is now multiplied by 100 in the aggregation function to represent it as a percentage. Regarding the evaluation process, `math` module importation and refactoring reduce progress log verbosity by logging every 100 evaluations instead of 10. It prevents potential logging overflow. Handling of NaN values is added to ensure 'default_value' is set in case of missing data, avoiding errors in split, category, and l2-category assignments. Finally, reporting of categorical and l2-categorical accuracies is streamlined through a new `calculate_hit_rates` function, improving code readability and maintenance.

    Issue refs: #1427, #1533

    * Update GPT evaluation model name and API configuration

    * Refactor MMBench_Evaluator class to handle missing columns

    * Add print statements for detailed results in MMBench-CN(CC), MMBench-CN(Dev), and MMBench-EN(Dev) evaluations

    * Refactor MMBench-CN and MMBench-EN evaluation functions

    * 🔄 Refactor result processing and logging logic

    - Simplified the result processing functions across different utility modules (`cc_utils.py`, `cn_utils.py`, `en_utils.py`) to unify the handling of multiple-choice options. Now, all options ("A" to "E") are dynamically added to the result data, and default to "nan" if not provided in the document.
    - Removed redundant keys directly from the process results dict creation to avoid clutter and align with the new dynamic addition of options.
    - In `mmbench_evals.py`, removed the unnecessary check for all splits being 'dev' and streamlined the evaluation loop by eliminating the progress bar (tqdm) for a cleaner log output.
    - Commented-out code and verbose logging during evaluation, which may have interfered with performance, has been removed for a more efficient and less intrusive logging experience.

    This cleanup reduces redundancy in the codebase and improves evaluation performance.

    Refs #2045

    ---------

    Co-authored-by: Bo Li <bo.li01@bytedance.com>
    (cherry picked from commit a19278c2ea6ddcbca64d3cc7f4efec7fe5775121)

commit 8d3526c0869f0ad7747ff6bb02441140792b461c
Author: cocoshe <1228759711@qq.com>
Date:   Thu Mar 28 13:38:36 2024 +0800

    fix doc

* feat: Add LlavaOneVision model to available models

chore: Update sqlitedict dependency to version 2.1.0

* Revert "Squashed commit of the following:"

This reverts commit 11b00999df3c43cb225482e030b791b2d454124c.

* Refactor available models in lmms_eval

Remove duplicate entries for "llava_hf", "llava_onevision", and "longva" in the AVAILABLE_MODELS dictionary in lmms_eval/models/__init__.py.

* fix: Handle import errors in lmms_eval models/__init__.py

The code changes in this commit fix the handling of import errors in the lmms_eval/models/__init__.py file. Previously, when an import error occurred, the code simply ignored it. This commit updates the code to log an error message using the logger module when an import error occurs.

This commit also removes duplicate entries for "llava_hf", "llava_onevision", and "longva" in the AVAILABLE_MODELS dictionary.

Recent user commits:
- Refactor available models in lmms_eval
- Revert "Squashed commit of the following:"
- feat: Add LlavaOneVision model to available models
- chore: Update sqlitedict dependency to version 2.1.0

* fix: Handle import errors in lmms_eval models/__init__.py

* chore: Remove unused imports in lmms_eval/models/__init__.py and lmms_eval/tasks/vcr_wiki/utils.py

* Remove unused imports in lmms_eval/tasks/vcr_wiki/utils.py

* chore: Update lmms_eval/tasks/vcr_wiki/utils.py

This commit updates the `lmms_eval/tasks/vcr_wiki/utils.py` file. It removes unused imports and fixes the condition for loading Spacy models based on the `load_package` value in the config file. Additionally, it adds a debug log message when the Spacy models are not loaded due to `load_package` being set to False.

Remove unused imports in `lmms_eval/tasks/vcr_wiki/utils.py`

* feat: Add new subtasks to overall score calculation

The code changes in this commit add new subtasks to the overall score calculation in the `overall_score` function. The subtasks "ScanQA", "BLINK", "MathVerse", "SciVerse", and "Mantis" are included in the `categories` dictionary. This ensures that the scores for these subtasks are calculated and included in the evaluation results.

Remove unused imports and update subtask categories in `utils.py`

* feat: Add new subtasks to overall score calculation

* chore: Update lmms_eval/tasks/llava_interleave_bench/_default_template_interleave_yaml

Update the image aspect ratio in the default template for the llava_interleave_bench task. Change the value of "image_aspect_ratio" from "original" to "pad". This ensures that the generated images have a padded aspect ratio.

* if no response directly return 0

* Squashed commit of the following:

commit b2a009b6bbf8353172f5a1dd9c29ea1f67610c02
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Mon Jul 15 19:12:25 2024 -0700

    if no response directly return 0 (#142)

commit 5fc5f2f5acf454fc99448b0d62eb52b4bffba0d5
Author: Kaichen Zhang - NTU <kaichenzhang358@outlook.com>
Date:   Tue Jul 16 10:12:11 2024 +0800

    Add Muirbench (#143)

    * handle gen kwargs in internvl2

    * Add muirbench

* Add files via upload

(cherry picked from commit 557083a156c3dd67ac79e22b4202e9b69b6b00f4)

* update

---------

Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
Co-authored-by: Yan Shu <570533048@qq.com>

* Fix llava onevision loglikelihood video bug

* LiveBench July (#146)

* claude auto detect json mode

* extract information

* use claude to generate

* fix bugs

* fix

* generate data

* chore: Update dataset name and version for live_bench task

* gpt-4-turbo => gpt-4o

* chore: Update dataset capture settings in create_dataset.py

* everything use gpt-4o

* websites

* livebench_july

* Refactor code to simplify data assignment in example.ipynb

* chore: Update dataset name for live_bench task

* Add xcomposer2d5 from fanyi, revise something for better usage (#145)

* internvl2

* fix some bugs

* fix

* lint

* feat: Add XComposer2D5 model to AVAILABLE_MODELS

* xcomposer

* Fix llava vid error when using public

* Fix xcomposer2d5

* Add generation tokens

---------

Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>

* Dev/ov evals (#147)

* fix doc

* [WIP] adding mmbench dev evaluation (#75)

* WIP

* Update GPT evaluation model name and sys prompt

* 🛠️ Scale accuracy to percentage

The accuracy value is now multiplied by 100 in the aggregation function to represent it as a percentage. Regarding the evaluation process, `math` module importation and refactoring reduce progress log verbosity by logging every 100 evaluations instead of 10. It prevents potential logging overflow. Handling of NaN values is added to ensure 'default_value' is set in case of missing data, avoiding errors in split, category, and l2-category assignments. Finally, reporting of categorical and l2-categorical accuracies is streamlined through a new `calculate_hit_rates` function, improving code readability and maintenance.

Issue refs: #1427, #1533

* Update GPT evaluation model name and API configuration

* Refactor MMBench_Evaluator class to handle missing columns

* Add print statements for detailed results in MMBench-CN(CC), MMBench-CN(Dev), and MMBench-EN(Dev) evaluations

* Refactor MMBench-CN and MMBench-EN evaluation functions

* 🔄 Refactor result processing and logging logic

- Simplified the result processing functions across different utility modules (`cc_utils.py`, `cn_utils.py`, `en_utils.py`) to unify the handling of multiple-choice options. Now, all options ("A" to "E") are dynamically added to the result data, and default to "nan" if not provided in the document.
- Removed redundant keys directly from the process results dict creation to avoid clutter and align with the new dynamic addition of options.
- In `mmbench_evals.py`, removed the unnecessary check for all splits being 'dev' and streamlined the evaluation loop by eliminating the progress bar (tqdm) for a cleaner log output.
- Commented-out code and verbose logging during evaluation, which may have interfered with performance, has been removed for a more efficient and less intrusive logging experience.

This cleanup reduces redundancy in the codebase and improves evaluation performance.

Refs #2045

---------

Co-authored-by: Bo Li <bo.li01@bytedance.com>
(cherry picked from commit a19278c2ea6ddcbca64d3cc7f4efec7fe5775121)

* Create README.md

* Add files via upload

* Add MathVerse

* Fix typo in qwen_vl that was causing "reference before assignment"

* convert contexts to list if necessary and remove unnecessary construction of `questions`

* refactor query construction for clarity

* Create ScreenSpot on clean branch

* Update README to reflect new tasks

* Add README file specific to ScreenSpot

* slight update

* Init webSRC

* Draft README for WebSRC

* Update main README with new task names

* Draft and validate websrc eval on dev split

* Add code to enable compilation of submission for WebSRC test split

* Bugfix: WebSRC should be token-level F1 NOT character-level

* Add qwen vl api

* Fix llava conv template for llama3

* Fix llava_hf generation for 1.6

* Parse result for llava_hf 1.6

* Add model_name parameter to Llava constructor

* Fix endless warning for llava_hf generation

* Fix llava_hf image tokens number issue

* Create LICENSE

* Update LICENSE

* Update LICENSE

* Better task list_with_num

* Fix idefics2 llava in the wild bugs

* Remove redundant code in fuyu

* Fix instructblip qformer size mismatch and multi-images problem

* Comment out parse result in xcomposer

* Comment out Spice in caption task so that don't need to download stanford nlp model

* Update gitignore

* Add separated pope tasks by category

* Fix pope random name in pope full

* Set printing info for llava_hf to debug level

* Adding Phi3v model.

* Adding prompt arguments for Phi3v on MathVista-TestMini

* Adding documentation of Phi3v class.

* [Fix] import issues of multilingual llava and olympiadbench

* fix compatibility issue of older version llava

* add upd

* add upd

* add upd

* add upd

* add upd

* add upd

* Group MMMU images into one image (#83)

* update

* update font

* Add matplotlib.font_manager import in utils.py

* Refactor font handling in add_order_label function in utils.py

* group mmmu

---------

Co-authored-by: Li Bo <drluodian@gmail.com>

* merge model_specific_prompt_kwargs and dataset_name into each task yaml

* Add MathVerse in README.md

* slightly change query_prompt for the reproduction

* update utils.py for leaderboard submission

* add conbench

* update README

* Update README.md

* init include vcr

* modify the form of VCR

* switch logic

* add crossed_text to vcr_wiki output

* include the try-except logic for spacy

* update vcr_wiki tasks

* update vcr_wiki tasks in README.md

* include std and confidence interval

* update gpt-3.5-turbo version

* update gpt-3.5-turbo version

* chore: Remove unnecessary files and code related to live_bench and sft_eval tasks

* Bump version to 0.2.0.dev0

* chore: Update lmms-eval to support video evaluations for LLaVA models

* Update llava conv_template in lmms_eval/models/llava.py

* Update image alignment in README.md

* chore: Update lmms-eval to support video evaluations for LLaVA models

* chore: Update lmms-eval to support video evaluations for LLaVA models

* Update README.md

* Update README.md

* update aggregation function for vcr_wiki

* update README.md

* Update README.md

* update version

* add II-Bench

* fix dataset_path

* Add qbench, qbench2, abench; fix phi3v as its current implementation does not support multi-image

* add tinyllava

* LongVideoBench support: image LMMs (idefics2, phi3) and video LMMs (LLaVA-Next-Video-34B)

* fix #117, allow auto download with tar format videos

* fix #117, allow auto download with tar format videos

* fix typo

* feat: Add support for auto downloading tar format videos

* Release llava-wilder

* chore: Update dependencies to fix potential risks and improve compatibility

* tutorial

* docs

* update preparation

* small fix

* small fix

* lint

* to sh script

* update readme

* Remove handling non-visual loop in llava

* Add llava_hf back to registry

* Update README.md

* Update README.md

* update ablation for videomme datasets

* chore: Handle ImportError when importing models

Handle the ImportError exception when importing models in the lmms_eval package. This change adds a try-except block to catch the ImportError and print an error message indicating the failed import. This will help with troubleshooting and identifying any issues with the model imports.

* chore: Remove unused models from lmms_eval package

* feat: Allow loading model configurations from other packages

* feat: Allow including external tasks from plugins

* chore: Add loguru for logging in lmms_eval package

* Remove unnecessary lines since use batched visuals now in llava

* Add longva

* Revise model registry for llava_hf and longva

* Delete unnecessary lines

* Remove unnecessary lines for video llava

* Update pyproject.toml

* Update activitynetqa_generation.yaml

* Fix vid mme post prompt issue

* new task gqa-ru

* add mmbench_ru_dev

* change prompt to ru

* create new task vitatecs

* Update README.md

* Add wild vision 0617

* Hardcode to keep image for wild vision

* Fixing scoring logic

* Fixing dataset name

* Fixing handling None filtered score

* Add detailcaps

* Add install capture_metric in env

* Add files via upload

* feat: Add tie_weights parameter to Llava model initialization

* Upgrade lmms-eval to support more models and evaluation tasks

* Upgrade lmms-eval to version 0.2.1

* Rename xcomposer 4KHD

* chore: Update lmms_eval/models/vila.py and lmms_eval/tasks/__init__.py

* Update utils.py

* Update _default_template_vcr_yaml

* add process sync via temp file in lmms_eval/evaluator.py

* Update utils.py

* Update _default_template_vcr_yaml

* Add muirbench

* Squashed commit of the following:

commit dfdba507b5fbe985b0030ffec575f9f2638bc1ed
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Jul 16 11:13:52 2024 +0800

    merge ov evals (#144)

    * chore: Update gpt_eval_model_name to "gpt-3.5-turbo" in mathvista.yaml

    * Squashed commit of the following:

    commit 994c9f97a2f8db3e9b7d7933d1e1680acde5b70b
    Author: Yan Shu <570533048@qq.com>
    Date:   Mon Jul 8 17:21:23 2024 +0800

        Add files via upload

    * Squashed commit of the following:

    commit e31cd7883d4555c7530795c7f102b8d78cbd372f
    Author: Bo Li <drluodian@gmail.com>
    Date:   Wed Jul 10 12:08:08 2024 +1000

        chore: Update lmms_eval/models/vila.py and lmms_eval/tasks/__init__.py

    commit 1d8c980d1089f9d7702c3b92d5c85039f2809c6d
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Tue Jul 9 02:08:52 2024 +0000

        Rename xcomposer 4KHD

    commit 6da76f36ecf5f9aa73057e767a4fcb60c99ff896
    Author: Bo Li <drluodian@gmail.com>
    Date:   Tue Jul 9 11:55:56 2024 +1000

        Upgrade lmms-eval to version 0.2.1

    commit cd1858523fcd8630082cbefba8710e0de3ee8805
    Author: Bo Li <drluodian@gmail.com>
    Date:   Tue Jul 9 11:52:23 2024 +1000

        Upgrade lmms-eval to support more models and evaluation tasks

    commit 672d7e5bb49dcb34e1b2fdeb09f3f4588dc583a6
    Author: Bo Li <drluodian@gmail.com>
    Date:   Tue Jul 9 11:43:41 2024 +1000

        feat: Add tie_weights parameter to Llava model initialization

    commit 2037a86261b55fa42b8ba3a04eab192b3e69d6ea
    Merge: e6844db1 a5c18692
    Author: Bo Li <drluodian@gmail.com>
    Date:   Tue Jul 9 11:37:12 2024 +1000

        Fix gen kwargs image aspect ratio in internvl2

    commit a5c186925de989b616f58a35ece36065a32b4594
    Merge: 2ebec77f 557083a1
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Jul 9 09:15:56 2024 +0800

        Merge pull request #137 from shuyansy/main

        add MLVU task

    commit 557083a156c3dd67ac79e22b4202e9b69b6b00f4
    Author: Yan Shu <570533048@qq.com>
    Date:   Mon Jul 8 16:56:50 2024 +0800

        Add files via upload

    commit 2ebec77f5606d79e9a7b995970e32792050606a1
    Merge: 211bfede b23d349e
    Author: Li Bo <drluodian@gmail.com>
    Date:   Mon Jul 8 11:53:06 2024 +0800

        Merge pull request #136 from Dousia/main

        Add detailcaps

    commit b23d349e46d60dc149ffaa54d6e019f4996ed92d
    Author: ByteDance <bytedance@MacBook-Pro.local>
    Date:   Sun Jul 7 23:24:19 2024 +0800

        Add install capture_metric in env

    commit c6e211d5f9dbb7572d3a141b6504cb1ca2007c33
    Author: ByteDance <bytedance@MacBook-Pro.local>
    Date:   Sun Jul 7 23:04:13 2024 +0800

        Add detailcaps

    commit 211bfedebad243ef82a8b0be36c3b5a9b9cb2f72
    Merge: 7c208b76 79514eee
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Jul 2 23:05:12 2024 +0800

        Merge pull request #133 from EvolvingLMMs-Lab/dev/wild_vision

        Add wild vision bench

    commit 79514eeebcfd6f655be2a10c776037d12a7b7214
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Mon Jul 1 15:10:02 2024 +0000

        Fixing handling None filtered score

    commit 725fac2781446958b905e1e6c6eb3c0a8e582e49
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Mon Jul 1 08:25:42 2024 +0000

        Fixing dataset name

    commit 8d963e132ac03fc0d835d480cfcfcabe72af143c
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Mon Jul 1 08:24:51 2024 +0000

        Fixing scoring logic

    commit e2990d0a69e876721256fdf946c68ba7ae0cbdc1
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Mon Jul 1 06:06:57 2024 +0000

        Hardcode to keep image for wild vision

    commit ed381736730d8fb785b4ee919fdb751734ecef25
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Mon Jul 1 06:06:38 2024 +0000

        Add wild vision 0617

    commit 7c208b76640c986cfe94233dce735c3ca4ad4319
    Author: Li Bo <drluodian@gmail.com>
    Date:   Mon Jul 1 11:53:31 2024 +0800

        Update README.md

    commit 39d40dea47bc59ff04e8b0cbc445345098debc9a
    Merge: e19b43a3 ba7081c0
    Author: Li Bo <drluodian@gmail.com>
    Date:   Mon Jul 1 11:47:09 2024 +0800

        Merge pull request #129 from Dannoopsy/mmbench_ru

        add task MMBench-ru

    commit e19b43a3a1e7212e623061b164b0419cc0dda689
    Merge: 11fd7e3f a0de8970
    Author: Li Bo <drluodian@gmail.com>
    Date:   Mon Jul 1 11:46:58 2024 +0800

        Merge pull request #128 from Dannoopsy/gqa-ru

        add task gqa-ru

    commit 11fd7e3fc05908aeb01e4a6161a7b55cd38b3122
    Merge: 383e7fea a7522592
    Author: Li Bo <drluodian@gmail.com>
    Date:   Mon Jul 1 11:46:16 2024 +0800

        Merge pull request #130 from lscpku/vitatecs

        Add task VITATECS

    commit a75225926e5954f85466d257f99acf0163fde596
    Author: lscpku <lisc99@pku.edu.cn>
    Date:   Fri Jun 28 20:37:06 2024 +0800

        create new task vitatecs

    commit ba7081c0abac840002d320e30733e891298dfa11
    Author: Dannoopsy <63581325+Dannoopsy@users.noreply.github.com>
    Date:   Fri Jun 28 12:21:05 2024 +0300

        change prompt to ru

    commit 27ea9c0055a8abf3a8198829b8617018479918e2
    Author: Dannoopsy <belopolskikh.dd@phystech.edu>
    Date:   Thu Jun 27 17:17:29 2024 +0000

        add mmbench_ru_dev

    commit 383e7fead3138aedf62e9c0ec48303835ef26e2a
    Merge: 06fa000f ed2e7f79
    Author: Li Bo <drluodian@gmail.com>
    Date:   Fri Jun 28 00:14:10 2024 +0800

        Merge pull request #126 from lorenzomammana/feature/external-package-integration

        External package integration using plugins

    commit ed2e7f792151d21bce8f1c498270b9391e1d5c85
    Merge: 03947e14 06fa000f
    Author: Lorenzo Mammana <mammanalorenzo@outlook.it>
    Date:   Thu Jun 27 15:38:10 2024 +0000

        Merge branch 'main' into feature/external-package-integration

    commit a0de89708d5e6f259bb17f0eaace3c5b901b275c
    Author: Dannoopsy <belopolskikh.dd@phystech.edu>
    Date:   Tue Jun 25 11:11:37 2024 +0000

        new task gqa-ru

    commit 06fa000f60d3e4d160fac8ceb9959ae92a98f752
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Tue Jun 25 06:41:13 2024 +0000

        Fix vid mme post prompt issue

    commit b388d79e0df6f60068196cb7047453ebd22d6ef1
    Author: Li Bo <drluodian@gmail.com>
    Date:   Sun Jun 23 22:31:16 2024 +0800

        Update activitynetqa_generation.yaml

    commit 8f9d620fcd9d0a0742ee6bcf51ea63bd6b088a36
    Author: Li Bo <drluodian@gmail.com>
    Date:   Sun Jun 23 14:02:25 2024 +0800

        Update pyproject.toml

    commit 6341b7c15ce9fb28eb06b067ddb299d6cf2e16c3
    Merge: fce85f1b 903b042b
    Author: Li Bo <drluodian@gmail.com>
    Date:   Sun Jun 23 14:02:02 2024 +0800

        Merge pull request #125 from EvolvingLMMs-Lab/dev/interleave

        [Model] aligned llava-interleave model results on video tasks

    commit 903b042be016016d4ebeecb07701f3076a2d323c
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Sat Jun 22 12:07:13 2024 +0000

        Remove unnecessary lines for video llava

    commit d78ec86407b729a964906a8c2e50704b4bc74d06
    Merge: ebe7217a fce85f1b
    Author: Li Bo <drluodian@gmail.com>
    Date:   Sat Jun 22 13:57:31 2024 +0800

        Merge branch 'main' into dev/interleave

    commit ebe7217a486c1e754e42c2cbdb834e09fbbcc9b0
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Sat Jun 22 02:57:08 2024 +0000

        Delete unnecessary lines

    commit 120c474b056f9177c74e1fd9691d59e2f234b785
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Fri Jun 21 08:38:41 2024 +0000

        Revise model registry for llava_hf and longva

    commit 7d6201f921088afd3f52a35076e3c6fcc9aa518c
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Fri Jun 21 08:38:24 2024 +0000

        Add longva

    commit 12f480699c71a12a24d4349d9b0681933201a3a6
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Fri Jun 21 08:35:39 2024 +0000

        Remove unnecessary lines since use batched visuals now in llava

    commit 12cea76f1f0f14b1fd1007c9d39a9b0557368637
    Author: Bo Li <drluodian@gmail.com>
    Date:   Thu Jun 20 18:15:32 2024 +0000

        chore: Add loguru for logging in lmms_eval package

    commit 03947e14a46fd25b412931f7c9c25f4a2971d0b4
    Author: Lorenzo Mammana <mammanalorenzo@outlook.it>
    Date:   Wed Jun 5 13:40:41 2024 +0000

        feat: Allow including external tasks from plugins

    commit b80a91f73e15ddd0b0ce1322d7d121fa14030eed
    Author: Lorenzo Mammana <mammanalorenzo@outlook.it>
    Date:   Wed Jun 5 13:04:55 2024 +0000

        feat: Allow loading model configurations from other packages

    commit 8ef24740dd48a11c97eb627f2fff4aca107fef0d
    Author: Bo Li <drluodian@gmail.com>
    Date:   Thu Jun 20 12:11:03 2024 +0000

        chore: Remove unused models from lmms_eval package

    commit af38885fc2e066f5ea44388f33e07176f836fe28
    Author: Bo Li <drluodian@gmail.com>
    Date:   Thu Jun 20 12:07:09 2024 +0000

        chore: Handle ImportError when importing models

        Handle the ImportError exception when importing models in the lmms_eval package. This change adds a try-except block to catch the ImportError and print an error message indicating the failed import. This will help with troubleshooting and identifying any issues with the model imports.

    commit fce85f1b03ff7043b29dee787c5d17a08dd2687a
    Merge: dbe63293 d94f83cb
    Author: Li Bo <drluodian@gmail.com>
    Date:   Thu Jun 20 20:02:12 2024 +0800

        Merge pull request #120 from EvolvingLMMs-Lab/pufanyi/hf_dataset_docs

        Add docs for datasets upload to HF

    commit dbe63293245a5141fdfd80bda7657c304f6bd32f
    Author: choiszt <ls2001927@sohu.com>
    Date:   Thu Jun 20 15:14:21 2024 +0800

        update ablation for videomme datasets

    commit d94f83cb3f08b61a2c75cc4326e58792100605b3
    Author: Li Bo <drluodian@gmail.com>
    Date:   Thu Jun 20 13:30:59 2024 +0800

        Update README.md

    commit cab8159ff35db330536c0b6dfb4b0a3b24142209
    Author: Li Bo <drluodian@gmail.com>
    Date:   Thu Jun 20 13:30:29 2024 +0800

        Update README.md

    commit 45876652a877a8006b828f32f5cc4660629f9190
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Thu Jun 20 03:55:30 2024 +0000

        Add llava_hf back to registry

    commit 3463651b8c54d36cd94169e3d376f5ed225a195a
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Thu Jun 20 03:54:33 2024 +0000

        Remove handling non-visual loop in llava

    commit cb0d3f49b72790b081f981e0e6147131542f7f68
    Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Date:   Thu Jun 20 02:11:18 2024 +0800

        update readme

    commit 813877bfe5ac590cdbe92dd74d18f83a2091f748
    Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Date:   Wed Jun 19 15:37:52 2024 +0800

        to sh script

    commit a14684b8557d5894976448a5c559ed7a66a6cf16
    Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Date:   Wed Jun 19 15:37:04 2024 +0800

        lint

    commit d0f8851d42ba31f5da2a7a65e91499db45174dbc
    Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Date:   Wed Jun 19 15:36:48 2024 +0800

        small fix

    commit 63748e9718f287ad433afc90e340b5e17a89c1ed
    Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Date:   Wed Jun 19 15:36:43 2024 +0800

        small fix

    commit 7f1159a1fe04cfb783dc31d4fbdef3bda0ce19e4
    Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Date:   Wed Jun 19 15:35:05 2024 +0800

        update preparation

    commit 19f9bd621c76a483ff98f8c7eb78f64753da683a
    Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Date:   Wed Jun 19 15:23:24 2024 +0800

        docs

    commit ce6f889ba02d819979c7922f6336cf4f1f718f65
    Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Date:   Wed Jun 19 15:04:16 2024 +0800

        tutorial

    commit f513c520c2a3dad26d2b2ca5c4ed4db05a493c73
    Author: Bo Li <drluodian@gmail.com>
    Date:   Wed Jun 19 06:51:19 2024 +0000

        chore: Update dependencies to fix potential risks and improve compatibility

    commit efb529552c5e4ba039a4cba8e9aa5cb7ba65bf90
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Wed Jun 19 10:25:58 2024 +0800

        Release llava-wilder

    commit 742651fc9daf97e2f57831ed6e6e7ee7ead7d555
    Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Date:   Wed Jun 19 07:44:26 2024 +0800

        feat: Add support for auto downloading tar format videos

    commit 511b6259828212fcba954cdeb8cf90d6e5daabf8
    Merge: 22a4958e 050b2c37
    Author: Bo Li <drluodian@gmail.com>
    Date:   Tue Jun 18 17:01:03 2024 +0000

        Merge branch 'main' of https://github.com/EvolvingLMMs-Lab/lmms-eval

    commit 050b2c370017e9b97475dd6cf01fd051b5ca5c86
    Merge: 74facb41 ef306512
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Jun 18 13:13:38 2024 +0800

        Merge pull request #114 from zjysteven/add-tinyllava

        add tinyllava

    commit ef306512e5135f76dffa383f600b8733015836e8
    Author: Jingyang Zhang <jingyang.zhang@duke.edu>
    Date:   Mon Jun 17 17:57:02 2024 -0400

        fix typo

    commit 9bab67732a4238097725deddf867fb1946ffee40
    Merge: dbfb2387 74facb41
    Author: Jingyang Zhang <jingyang.zhang@duke.edu>
    Date:   Sun Jun 16 10:56:05 2024 -0400

        Merge branch 'EvolvingLMMs-Lab:main' into add-tinyllava

    commit 74facb41a826691dfce4458cf1d8659b34fc5bf5
    Merge: 8ba192f9 d5df72de
    Author: Li Bo <drluodian@gmail.com>
    Date:   Sun Jun 16 17:59:19 2024 +0800

        Merge pull request #118 from teowu/main

        Fix the potential risk by PR #117

    commit d5df72de2d03108d6b365818ecc3551ac9aa6302
    Merge: 5bf59ed2 8ba192f9
    Author: Teo (Timothy) Wu Haoning <38696372+teowu@users.noreply.github.com>
    Date:   Sun Jun 16 15:32:13 2024 +0800

        Merge branch 'EvolvingLMMs-Lab:main' into main

    commit 5bf59ed250da98a408a94e214a73caa400cba842
    Author: teowu <realtimothyhwu@gmail.com>
    Date:   Sun Jun 16 07:27:28 2024 +0000

        fix #117, allow auto download with tar format videos

    commit 98b3955cb808e…
kcz358 pushed a commit that referenced this pull request Sep 5, 2024
* chore: Update gpt_eval_model_name to "gpt-3.5-turbo" in mathvista.yaml

* Squashed commit of the following:

commit 994c9f97a2f8db3e9b7d7933d1e1680acde5b70b
Author: Yan Shu <570533048@qq.com>
Date:   Mon Jul 8 17:21:23 2024 +0800

    Add files via upload

* Squashed commit of the following:

commit e31cd78
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jul 10 12:08:08 2024 +1000

    chore: Update lmms_eval/models/vila.py and lmms_eval/tasks/__init__.py

commit 1d8c980
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Tue Jul 9 02:08:52 2024 +0000

    Rename xcomposer 4KHD

commit 6da76f3
Author: Bo Li <drluodian@gmail.com>
Date:   Tue Jul 9 11:55:56 2024 +1000

    Upgrade lmms-eval to version 0.2.1

commit cd18585
Author: Bo Li <drluodian@gmail.com>
Date:   Tue Jul 9 11:52:23 2024 +1000

    Upgrade lmms-eval to support more models and evaluation tasks

commit 672d7e5
Author: Bo Li <drluodian@gmail.com>
Date:   Tue Jul 9 11:43:41 2024 +1000

    feat: Add tie_weights parameter to Llava model initialization

commit 2037a86
Merge: e6844db a5c1869
Author: Bo Li <drluodian@gmail.com>
Date:   Tue Jul 9 11:37:12 2024 +1000

    Fix gen kwargs image aspect ratio in internvl2

commit a5c1869
Merge: 2ebec77 557083a
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Jul 9 09:15:56 2024 +0800

    Merge pull request #137 from shuyansy/main

    add MLVU task

commit 557083a
Author: Yan Shu <570533048@qq.com>
Date:   Mon Jul 8 16:56:50 2024 +0800

    Add files via upload

commit 2ebec77
Merge: 211bfed b23d349
Author: Li Bo <drluodian@gmail.com>
Date:   Mon Jul 8 11:53:06 2024 +0800

    Merge pull request #136 from Dousia/main

    Add detailcaps

commit b23d349
Author: ByteDance <bytedance@MacBook-Pro.local>
Date:   Sun Jul 7 23:24:19 2024 +0800

    Add install capture_metric in env

commit c6e211d
Author: ByteDance <bytedance@MacBook-Pro.local>
Date:   Sun Jul 7 23:04:13 2024 +0800

    Add detailcaps

commit 211bfed
Merge: 7c208b7 79514ee
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Jul 2 23:05:12 2024 +0800

    Merge pull request #133 from EvolvingLMMs-Lab/dev/wild_vision

    Add wild vision bench

commit 79514ee
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Mon Jul 1 15:10:02 2024 +0000

    Fixing handling None filtered score

commit 725fac2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Mon Jul 1 08:25:42 2024 +0000

    Fixing dataset name

commit 8d963e1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Mon Jul 1 08:24:51 2024 +0000

    Fixing scoring logic

commit e2990d0
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Mon Jul 1 06:06:57 2024 +0000

    Hardcode to keep image for wild vision

commit ed38173
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Mon Jul 1 06:06:38 2024 +0000

    Add wild vision 0617

commit 7c208b7
Author: Li Bo <drluodian@gmail.com>
Date:   Mon Jul 1 11:53:31 2024 +0800

    Update README.md

commit 39d40de
Merge: e19b43a ba7081c
Author: Li Bo <drluodian@gmail.com>
Date:   Mon Jul 1 11:47:09 2024 +0800

    Merge pull request #129 from Dannoopsy/mmbench_ru

    add task MMBench-ru

commit e19b43a
Merge: 11fd7e3 a0de897
Author: Li Bo <drluodian@gmail.com>
Date:   Mon Jul 1 11:46:58 2024 +0800

    Merge pull request #128 from Dannoopsy/gqa-ru

    add task gqa-ru

commit 11fd7e3
Merge: 383e7fe a752259
Author: Li Bo <drluodian@gmail.com>
Date:   Mon Jul 1 11:46:16 2024 +0800

    Merge pull request #130 from lscpku/vitatecs

    Add task VITATECS

commit a752259
Author: lscpku <lisc99@pku.edu.cn>
Date:   Fri Jun 28 20:37:06 2024 +0800

    create new task vitatecs

commit ba7081c
Author: Dannoopsy <63581325+Dannoopsy@users.noreply.github.com>
Date:   Fri Jun 28 12:21:05 2024 +0300

    change prompt to ru

commit 27ea9c0
Author: Dannoopsy <belopolskikh.dd@phystech.edu>
Date:   Thu Jun 27 17:17:29 2024 +0000

    add mmbench_ru_dev

commit 383e7fe
Merge: 06fa000 ed2e7f7
Author: Li Bo <drluodian@gmail.com>
Date:   Fri Jun 28 00:14:10 2024 +0800

    Merge pull request #126 from lorenzomammana/feature/external-package-integration

    External package integration using plugins

commit ed2e7f7
Merge: 03947e1 06fa000
Author: Lorenzo Mammana <mammanalorenzo@outlook.it>
Date:   Thu Jun 27 15:38:10 2024 +0000

    Merge branch 'main' into feature/external-package-integration

commit a0de897
Author: Dannoopsy <belopolskikh.dd@phystech.edu>
Date:   Tue Jun 25 11:11:37 2024 +0000

    new task gqa-ru

commit 06fa000
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Tue Jun 25 06:41:13 2024 +0000

    Fix vid mme post prompt issue

commit b388d79
Author: Li Bo <drluodian@gmail.com>
Date:   Sun Jun 23 22:31:16 2024 +0800

    Update activitynetqa_generation.yaml

commit 8f9d620
Author: Li Bo <drluodian@gmail.com>
Date:   Sun Jun 23 14:02:25 2024 +0800

    Update pyproject.toml

commit 6341b7c
Merge: fce85f1 903b042
Author: Li Bo <drluodian@gmail.com>
Date:   Sun Jun 23 14:02:02 2024 +0800

    Merge pull request #125 from EvolvingLMMs-Lab/dev/interleave

    [Model] aligned llava-interleave model results on video tasks

commit 903b042
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Jun 22 12:07:13 2024 +0000

    Remove unnecessary lines for video llava

commit d78ec86
Merge: ebe7217 fce85f1
Author: Li Bo <drluodian@gmail.com>
Date:   Sat Jun 22 13:57:31 2024 +0800

    Merge branch 'main' into dev/interleave

commit ebe7217
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Jun 22 02:57:08 2024 +0000

    Delete unnecessary lines

commit 120c474
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Jun 21 08:38:41 2024 +0000

    Revise model registry for llava_hf and longva

commit 7d6201f
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Jun 21 08:38:24 2024 +0000

    Add longva

commit 12f4806
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Jun 21 08:35:39 2024 +0000

    Remove unnecessary lines since use batched visuals now in llava

commit 12cea76
Author: Bo Li <drluodian@gmail.com>
Date:   Thu Jun 20 18:15:32 2024 +0000

    chore: Add loguru for logging in lmms_eval package

commit 03947e1
Author: Lorenzo Mammana <mammanalorenzo@outlook.it>
Date:   Wed Jun 5 13:40:41 2024 +0000

    feat: Allow including external tasks from plugins

commit b80a91f
Author: Lorenzo Mammana <mammanalorenzo@outlook.it>
Date:   Wed Jun 5 13:04:55 2024 +0000

    feat: Allow loading model configurations from other packages

commit 8ef2474
Author: Bo Li <drluodian@gmail.com>
Date:   Thu Jun 20 12:11:03 2024 +0000

    chore: Remove unused models from lmms_eval package

commit af38885
Author: Bo Li <drluodian@gmail.com>
Date:   Thu Jun 20 12:07:09 2024 +0000

    chore: Handle ImportError when importing models

    Handle the ImportError exception when importing models in the lmms_eval package. This change adds a try-except block to catch the ImportError and print an error message indicating the failed import. This will help with troubleshooting and identifying any issues with the model imports.

commit fce85f1
Merge: dbe6329 d94f83c
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 20 20:02:12 2024 +0800

    Merge pull request #120 from EvolvingLMMs-Lab/pufanyi/hf_dataset_docs

    Add docs for datasets upload to HF

commit dbe6329
Author: choiszt <ls2001927@sohu.com>
Date:   Thu Jun 20 15:14:21 2024 +0800

    update ablation for videomme datasets

commit d94f83c
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 20 13:30:59 2024 +0800

    Update README.md

commit cab8159
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 20 13:30:29 2024 +0800

    Update README.md

commit 4587665
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu Jun 20 03:55:30 2024 +0000

    Add llava_hf back to registry

commit 3463651
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu Jun 20 03:54:33 2024 +0000

    Remove handling non-visual loop in llava

commit cb0d3f4
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Thu Jun 20 02:11:18 2024 +0800

    update readme

commit 813877b
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Wed Jun 19 15:37:52 2024 +0800

    to sh script

commit a14684b
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Wed Jun 19 15:37:04 2024 +0800

    lint

commit d0f8851
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Wed Jun 19 15:36:48 2024 +0800

    small fix

commit 63748e9
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Wed Jun 19 15:36:43 2024 +0800

    small fix

commit 7f1159a
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Wed Jun 19 15:35:05 2024 +0800

    update preparation

commit 19f9bd6
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Wed Jun 19 15:23:24 2024 +0800

    docs

commit ce6f889
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Wed Jun 19 15:04:16 2024 +0800

    tutorial

commit f513c52
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jun 19 06:51:19 2024 +0000

    chore: Update dependencies to fix potential risks and improve compatibility

commit efb5295
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Wed Jun 19 10:25:58 2024 +0800

    Release llava-wilder

commit 742651f
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Wed Jun 19 07:44:26 2024 +0800

    feat: Add support for auto downloading tar format videos

commit 511b625
Merge: 22a4958 050b2c3
Author: Bo Li <drluodian@gmail.com>
Date:   Tue Jun 18 17:01:03 2024 +0000

    Merge branch 'main' of https://github.com/EvolvingLMMs-Lab/lmms-eval

commit 050b2c3
Merge: 74facb4 ef30651
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Jun 18 13:13:38 2024 +0800

    Merge pull request #114 from zjysteven/add-tinyllava

    add tinyllava

commit ef30651
Author: Jingyang Zhang <jingyang.zhang@duke.edu>
Date:   Mon Jun 17 17:57:02 2024 -0400

    fix typo

commit 9bab677
Merge: dbfb238 74facb4
Author: Jingyang Zhang <jingyang.zhang@duke.edu>
Date:   Sun Jun 16 10:56:05 2024 -0400

    Merge branch 'EvolvingLMMs-Lab:main' into add-tinyllava

commit 74facb4
Merge: 8ba192f d5df72d
Author: Li Bo <drluodian@gmail.com>
Date:   Sun Jun 16 17:59:19 2024 +0800

    Merge pull request #118 from teowu/main

    Fix the potential risk by PR #117

commit d5df72d
Merge: 5bf59ed 8ba192f
Author: Teo (Timothy) Wu Haoning <38696372+teowu@users.noreply.github.com>
Date:   Sun Jun 16 15:32:13 2024 +0800

    Merge branch 'EvolvingLMMs-Lab:main' into main

commit 5bf59ed
Author: teowu <realtimothyhwu@gmail.com>
Date:   Sun Jun 16 07:27:28 2024 +0000

    fix #117, allow auto download with tar format videos

commit 98b3955
Merge: a056f11 be9dada
Author: teowu <realtimothyhwu@gmail.com>
Date:   Sun Jun 16 07:25:07 2024 +0000

    Merge branch 'main' of https://github.com/teowu/lmms-eval into main

commit a056f11
Author: teowu <realtimothyhwu@gmail.com>
Date:   Sun Jun 16 07:23:54 2024 +0000

    fix #117, allow auto download with tar format videos

commit 8ba192f
Merge: 7cc2890 be9dada
Author: Li Bo <drluodian@gmail.com>
Date:   Sat Jun 15 17:30:59 2024 +0800

    Merge pull request #117 from teowu/main

    LongVideoBench for LMMs-Eval

commit be9dada
Merge: 62ea8ce 7cc2890
Author: Teo (Timothy) Wu Haoning <38696372+teowu@users.noreply.github.com>
Date:   Sat Jun 15 16:39:20 2024 +0800

    Merge pull request #1 from EvolvingLMMs-Lab/main

    Merge pull request #113 from teowu/main

commit 62ea8ce
Author: teowu <realtimothyhwu@gmail.com>
Date:   Sat Jun 15 08:30:11 2024 +0000

    LongVideoBench support: image LMMs (idefics2, phi3) and video LMMs (LLaVA-Next-Video-34B)

commit 7cc2890
Merge: 4bc7224 ea14cd4
Author: Li Bo <drluodian@gmail.com>
Date:   Sat Jun 15 14:10:22 2024 +0800

    Merge pull request #113 from teowu/main

    Q-Bench, Q-Bench2, A-Bench

commit dbfb238
Author: Jingyang <jingyang.zhang@duke.edu>
Date:   Fri Jun 14 16:20:42 2024 -0400

    add tinyllava

commit ea14cd4
Author: teowu <realtimothyhwu@gmail.com>
Date:   Fri Jun 14 15:01:52 2024 +0000

    Add qbench, qbench2, abench; fix phi3v as its current implementation does not support multi-image

commit 4bc7224
Merge: 2797987 bf14cb8
Author: Li Bo <drluodian@gmail.com>
Date:   Fri Jun 14 02:14:43 2024 +0800

    Merge pull request #111 from XinrunDu/main

    add II-Bench

commit bf14cb8
Author: XinrunDu <duxinrun2000@gmail.com>
Date:   Thu Jun 13 09:37:02 2024 +0000

    fix dataset_path

commit 6248113
Author: XinrunDu <duxinrun2000@gmail.com>
Date:   Thu Jun 13 09:32:06 2024 +0000

    add II-Bench

commit 2797987
Merge: 63d82f1 66d4bb2
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 13 11:14:47 2024 +0800

    Merge pull request #109 from EvolvingLMMs-Lab/pufanyi/update_version

    [Small Update] Update the version of LMMs-Eval

commit 66d4bb2
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Thu Jun 13 11:13:00 2024 +0800

    update version

commit 63d82f1
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 13 11:04:32 2024 +0800

    Update README.md

commit 44a3379
Merge: 5ed0035 0ce46d0
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 13 04:00:12 2024 +0800

    Merge pull request #105 from tianyu-z/main

    Include VCR

commit 0ce46d0
Author: Suyuchen <suyuchen.wang@umontreal.ca>
Date:   Wed Jun 12 15:56:34 2024 -0400

    update README.md

commit 46a88d8
Merge: 47b13b9 5ed0035
Author: Suyuchen <suyuchen.wang@umontreal.ca>
Date:   Wed Jun 12 15:50:26 2024 -0400

    merged readme.md

commit 47b13b9
Author: Suyuchen <suyuchen.wang@umontreal.ca>
Date:   Wed Jun 12 15:30:52 2024 -0400

    update aggregation function for vcr_wiki

commit 5ed0035
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 13 03:21:42 2024 +0800

    Update README.md

commit ed88068
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 13 03:13:59 2024 +0800

    Update README.md

commit fea3806
Merge: d99a24a 05dc8e8
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 13 03:11:49 2024 +0800

    Merge pull request #108 from EvolvingLMMs-Lab/internal_main_dev

    [Upgrade to v0.2] Embracing Video Evaluations with LMMs-Eval

commit 05dc8e8
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jun 12 15:56:04 2024 +0000

    chore: Update lmms-eval to support video evaluations for LLaVA models

commit cbeee20
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jun 12 15:50:30 2024 +0000

    chore: Update lmms-eval to support video evaluations for LLaVA models

commit f00d549
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jun 12 15:46:33 2024 +0000

    Update image alignment in README.md

commit 3415633
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jun 12 15:43:16 2024 +0000

    Update llava conv_template in lmms_eval/models/llava.py

commit 50575a9
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jun 12 15:39:03 2024 +0000

    chore: Update lmms-eval to support video evaluations for LLaVA models

commit c9b2252
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jun 12 15:33:48 2024 +0000

    Bump version to 0.2.0.dev0

commit 465bd42
Merge: e43bd84 d99a24a
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jun 12 15:04:25 2024 +0000

    Merge branch 'main' of https://github.com/EvolvingLMMs-Lab/lmms-eval into internal_main_dev

commit e43bd84
Author: Bo Li <drluodian@gmail.com>
Date:   Wed Jun 12 14:54:06 2024 +0000

    chore: Remove unnecessary files and code related to live_bench and sft_eval tasks

commit d99a24a
Merge: 374590b a66003b
Author: Li Bo <drluodian@gmail.com>
Date:   Wed Jun 12 19:45:57 2024 +0800

    Merge pull request #107 from AtsuMiyai/new_task/upd_update

    update gpt-3.5-turbo version

commit a66003b
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Wed Jun 12 17:05:17 2024 +0900

    update gpt-3.5-turbo version

commit ee91f27
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Wed Jun 12 16:50:53 2024 +0900

    update gpt-3.5-turbo version

commit 326b969
Author: tianyu-z <zhangtianyupro@gmail.com>
Date:   Mon Jun 10 20:07:40 2024 -0400

    include std and confidence interval

commit cd050d4
Author: Suyuchen <suyuchen.wang@umontreal.ca>
Date:   Mon Jun 10 18:49:47 2024 -0400

    update vcr_wiki tasks in README.md

commit 205721e
Author: Suyuchen <suyuchen.wang@umontreal.ca>
Date:   Mon Jun 10 18:43:15 2024 -0400

    update vcr_wiki tasks

commit db8e718
Author: tianyu-z <zhangtianyupro@gmail.com>
Date:   Mon Jun 10 16:13:58 2024 -0400

    include the try-except logic for spacy

commit 427dabb
Author: Suyuchen <suyuchen.wang@umontreal.ca>
Date:   Mon Jun 10 15:51:05 2024 -0400

    add crossed_text to vcr_wiki output

commit 043b483
Author: tianyu-z <zhangtianyupro@gmail.com>
Date:   Mon Jun 10 15:47:00 2024 -0400

    switch logic

commit e1f04db
Author: tianyu-z <zhangtianyupro@gmail.com>
Date:   Mon Jun 10 02:38:21 2024 -0400

    modify the form of VCR

commit 96e8d98
Author: tianyu-z <zhangtianyupro@gmail.com>
Date:   Mon Jun 10 00:10:30 2024 -0400

    init include vcr

commit 374590b
Merge: 504685e cb3b9ce
Author: Kaichen Zhang - NTU <kaichenzhang358@outlook.com>
Date:   Fri Jun 7 20:25:48 2024 +0800

    Merge pull request #101 from Gumpest/main

    Update conbench in README

commit 504685e
Author: Li Bo <drluodian@gmail.com>
Date:   Thu Jun 6 15:42:15 2024 +0800

    Update README.md

commit cb3b9ce
Merge: c9793b3 67b64ea
Author: Yuan Zhang <56063339+Gumpest@users.noreply.github.com>
Date:   Thu Jun 6 11:22:24 2024 +0800

    Merge branch 'EvolvingLMMs-Lab:main' into main

commit c9793b3
Author: Yuan Zhang <gump_well_done@163.com>
Date:   Thu Jun 6 11:21:05 2024 +0800

    update README

commit 67b64ea
Merge: 8ee7848 5fd6845
Author: Li Bo <drluodian@gmail.com>
Date:   Wed Jun 5 23:12:58 2024 +0800

    Merge pull request #100 from Gumpest/main

    add Conbench

commit 5fd6845
Author: Yuan Zhang <gump_well_done@163.com>
Date:   Wed Jun 5 21:52:31 2024 +0800

    add conbench

commit 8ee7848
Merge: 747e197 6fefaf7
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Jun 4 17:09:33 2024 +0800

    Merge pull request #95 from AtsuMiyai/new_task/upd

    add MM-UPD

commit 747e197
Merge: 4854a34 0584307
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Jun 4 17:09:04 2024 +0800

    Merge pull request #97 from CaraJ7/update

    Add MathVerse in README.md

commit 6fefaf7
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Tue Jun 4 17:36:39 2024 +0900

    update utils.py for leaderboard submission

commit 5f4fe36
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Sun Jun 2 23:28:27 2024 +0900

    slightly change query_prompt for the reproduction

commit 0584307
Author: CaraJ7 <1350074492@qq.com>
Date:   Sun Jun 2 17:05:28 2024 +0800

    Add MathVerse in README.md

commit 0581ab3
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Fri May 31 16:09:45 2024 +0900

    merge model_specific_prompt_kwargs and dataset_name into each task yaml

commit 4854a34
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Sat May 4 19:23:39 2024 +0800

    Group MMMU images into one image (#83)

    * update

    * update font

    * Add matplotlib.font_manager import in utils.py

    * Refactor font handling in add_order_label function in utils.py

    * group mmmu

    ---------

    Co-authored-by: Li Bo <drluodian@gmail.com>

commit d224794
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Wed May 29 15:15:59 2024 +0900

    add upd

commit 453e793
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Wed May 29 15:03:30 2024 +0900

    add upd

commit 909edd6
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Wed May 29 12:52:21 2024 +0900

    add upd

commit 7c1ac97
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Wed May 29 12:50:32 2024 +0900

    add upd

commit 811301c
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Wed May 29 12:46:58 2024 +0900

    add upd

commit 71401ba
Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
Date:   Wed May 29 12:41:21 2024 +0900

    add upd

commit 24dc435
Author: Bo Li <drluodian@gmail.com>
Date:   Mon May 27 10:17:32 2024 +0000

    fix compatibility issue of older version llava

commit 616edf4
Author: Bo Li <drluodian@gmail.com>
Date:   Mon May 27 09:32:26 2024 +0000

    [Fix] import issues of multilingual llava and olympiadbench

commit 4c5a99e
Merge: 45c05b2 b05c3e2
Author: Li Bo <drluodian@gmail.com>
Date:   Mon May 27 14:19:53 2024 +0800

    Merge pull request #87 from vfragoso/vifragos/phi3v

    Adding microsoft/Phi-3-vision-128k-instruct model.

commit b05c3e2
Author: Victor Fragoso <victor.fragoso@microsoft.com>
Date:   Fri May 24 16:36:37 2024 +0000

    Adding documentation of Phi3v class.

commit c200897
Author: Victor Fragoso <victor.fragoso@microsoft.com>
Date:   Fri May 24 16:25:02 2024 +0000

    Adding prompt arguments for Phi3v on MathVista-TestMini

commit 7f9fb6b
Author: Victor Fragoso <victor.fragoso@microsoft.com>
Date:   Fri May 24 13:24:16 2024 +0000

    Adding Phi3v model.

commit 45c05b2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu May 23 03:47:36 2024 +0000

    Set printing info for llava_hf to debug level

commit 53f013e
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu May 23 03:41:39 2024 +0000

    Fix pope random name in pope full

commit 22520a9
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu May 23 03:41:14 2024 +0000

    Add separated pope tasks by category

commit d1eefb1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu May 9 08:36:02 2024 +0000

    Update gitignore

commit b2b4dbd
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Mon May 20 07:45:11 2024 +0000

    Comment out Spice in caption task so that don't need to download stanford nlp model

commit 662f05c
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Mon May 20 03:13:13 2024 +0000

    Comment out parse result in xcomposer

commit 0932932
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu May 16 03:55:39 2024 +0000

    Fix instructblip qformer size mismatch and multi-images problem

commit 557a6a3
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu May 16 03:11:41 2024 +0000

    Remove redundant code in fuyu

commit 6aeb550
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu May 16 01:45:24 2024 +0000

    Fix idefics2 llava in the wild bugs

commit aea80e6
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Wed May 15 11:07:35 2024 +0000

    Better task list_with_num

commit 3c12a08
Author: Li Bo <drluodian@gmail.com>
Date:   Sat May 18 02:35:52 2024 +0800

    Update LICENSE

commit 82317a6
Author: Li Bo <drluodian@gmail.com>
Date:   Sat May 18 02:29:09 2024 +0800

    Update LICENSE

commit a8bba1c
Author: Li Bo <drluodian@gmail.com>
Date:   Sat May 18 02:28:03 2024 +0800

    Create LICENSE

commit caa5893
Merge: c094448 423b006
Author: Li Bo <drluodian@gmail.com>
Date:   Mon May 13 11:45:26 2024 +0800

    Merge pull request #73 from EvolvingLMMs-Lab/kc/qwen_vl_api

    [Feat] Add qwen vl api

commit c094448
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat May 11 06:11:19 2024 +0000

    Fix llava_hf image tokens number issue

commit 64f07e4
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Thu May 9 02:04:10 2024 +0000

    Fix endless warning for llava_hf generation

commit 8aaa828
Author: Bo Li <drluodian@gmail.com>
Date:   Thu May 2 06:13:56 2024 +0000

    Add model_name parameter to Llava constructor

commit 7847dc4
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Tue May 7 03:15:59 2024 +0000

    Parse result for llava_hf 1.6

commit 3e56b4f
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Tue May 7 03:09:56 2024 +0000

    Fix llava_hf generation for 1.6

commit fa3ff92
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Mon May 6 08:32:57 2024 +0000

    Fix llava conv template for llama3

commit 423b006
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun May 5 07:54:52 2024 +0000

    Add qwen vl api

commit b7fd7a9
Merge: 986139a c5a130b
Author: Li Bo <drluodian@gmail.com>
Date:   Sun May 5 13:19:48 2024 +0800

    Merge pull request #59 from EvolvingLMMs-Lab/add_idefics2

    add idefics2

commit 986139a
Merge: b46239c 8d3526c
Author: Li Bo <drluodian@gmail.com>
Date:   Fri May 3 01:18:18 2024 +0800

    Merge pull request #36 from cocoshe/main

    [Fix] repr llava doc

commit b46239c
Merge: bc69a74 373265f
Author: Li Bo <drluodian@gmail.com>
Date:   Fri May 3 01:17:34 2024 +0800

    Merge pull request #56 from gagan3012/main

    Multilingual LLava bench

commit bc69a74
Merge: eef3aeb 626e8a9
Author: Li Bo <drluodian@gmail.com>
Date:   Fri May 3 01:12:14 2024 +0800

    Merge pull request #70 from hunterheiden/hsh/new_task/WebSRC

    Bugfix: WebSRC should be token-level F1 NOT character-level

commit 626e8a9
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Thu May 2 09:31:03 2024 -0400

    Bugfix: WebSRC should be token-level F1 NOT character-level

commit eef3aeb
Merge: c4e9dd9 9bca441
Author: Li Bo <drluodian@gmail.com>
Date:   Thu May 2 14:38:17 2024 +0800

    Merge pull request #69 from hunterheiden/hsh/new_task/WebSRC

    [New Task] WebSRC (multimodal Q&A on web screenshots)

commit 9bca441
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Wed May 1 11:07:29 2024 -0400

    Add code to enable compilation of submission for WebSRC test split

commit 7687495
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Wed May 1 10:47:32 2024 -0400

    Draft and validate websrc eval on dev split

commit 4eebd3e
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Wed May 1 10:46:54 2024 -0400

    Update main README with new task names

commit 35fe80b
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Wed May 1 10:46:20 2024 -0400

    Draft README for WebSRC

commit 955bd06
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Tue Apr 30 10:16:21 2024 -0400

    Init webSRC

commit c4e9dd9
Merge: d8a3a99 319afcc
Author: Li Bo <drluodian@gmail.com>
Date:   Fri Apr 26 14:37:22 2024 +0800

    Merge pull request #63 from hunterheiden/hsh/new_task/screenspot

    New Task: ScreenSpot - Grounding (REC) and instruction generation (REG) on screens

commit 319afcc
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Thu Apr 25 11:44:34 2024 -0400

    slight update

commit 2f3811c
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Thu Apr 25 11:41:04 2024 -0400

    Add README file specific to ScreenSpot

commit 28962cb
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Wed Apr 24 11:52:33 2024 -0400

    Update README to reflect new tasks

commit e457cfb
Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
Date:   Tue Apr 23 18:33:16 2024 -0400

    Create ScreenSpot on clean branch

commit d8a3a99
Merge: 3dcd015 ed17129
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Apr 23 10:34:03 2024 +0800

    Merge pull request #61 from tupini07/patch-1

    Fix typo in Qwen-VL that was causing "reference before assignment"

commit ed17129
Author: Andrea Tupini <tupini07@gmail.com>
Date:   Mon Apr 22 14:56:41 2024 -0600

    refactor query construction for clarity

commit cd87420
Author: Andrea Tupini <tupini07@gmail.com>
Date:   Mon Apr 22 14:54:29 2024 -0600

    convert contexts to list if necessary and remove unnecessary construction of `questions`

commit 8557367
Author: Andrea Tupini <tupini07@gmail.com>
Date:   Mon Apr 22 14:47:33 2024 -0600

    Fix typo in qwen_vl that was causing "reference before assignment"

commit 3dcd015
Merge: 95df9fe 743673a
Author: Li Bo <drluodian@gmail.com>
Date:   Sat Apr 20 22:03:16 2024 +0800

    Merge pull request #60 from CaraJ7/main

    Add MathVerse

commit 743673a
Merge: c1a5472 95df9fe
Author: CaraJ7 <1350074492@qq.com>
Date:   Sat Apr 20 21:49:02 2024 +0800

    Merge branch 'main' of https://github.com/EvolvingLMMs-Lab/lmms-eval

commit c1a5472
Author: CaraJ7 <1350074492@qq.com>
Date:   Sat Apr 20 21:45:34 2024 +0800

    Add MathVerse

commit 373265f
Author: Gagan Bhatia <49101362+gagan3012@users.noreply.github.com>
Date:   Fri Apr 12 17:21:39 2024 -0700

    Add files via upload

commit d853051
Author: Gagan Bhatia <49101362+gagan3012@users.noreply.github.com>
Date:   Fri Apr 12 17:19:49 2024 -0700

    Create README.md

commit 22a4958
Author: Bo Li <bo.li01@bytedance.com>
Date:   Thu Apr 4 17:12:43 2024 +0000

    [WIP] adding mmbench dev evaluation (#75)

    * WIP

    * Update GPT evaluation model name and sys prompt

    * 🛠️ Scale accuracy to percentage

    The accuracy value is now multiplied by 100 in the aggregation function to represent it as a percentage. Regarding the evaluation process, `math` module importation and refactoring reduce progress log verbosity by logging every 100 evaluations instead of 10. It prevents potential logging overflow. Handling of NaN values is added to ensure 'default_value' is set in case of missing data, avoiding errors in split, category, and l2-category assignments. Finally, reporting of categorical and l2-categorical accuracies is streamlined through a new `calculate_hit_rates` function, improving code readability and maintenance.

    Issue refs: #1427, #1533

    * Update GPT evaluation model name and API configuration

    * Refactor MMBench_Evaluator class to handle missing columns

    * Add print statements for detailed results in MMBench-CN(CC), MMBench-CN(Dev), and MMBench-EN(Dev) evaluations

    * Refactor MMBench-CN and MMBench-EN evaluation functions

    * 🔄 Refactor result processing and logging logic

    - Simplified the result processing functions across different utility modules (`cc_utils.py`, `cn_utils.py`, `en_utils.py`) to unify the handling of multiple-choice options. Now, all options ("A" to "E") are dynamically added to the result data, and default to "nan" if not provided in the document.
    - Removed redundant keys directly from the process results dict creation to avoid clutter and align with the new dynamic addition of options.
    - In `mmbench_evals.py`, removed the unnecessary check for all splits being 'dev' and streamlined the evaluation loop by eliminating the progress bar (tqdm) for a cleaner log output.
    - Commented-out code and verbose logging during evaluation, which may have interfered with performance, has been removed for a more efficient and less intrusive logging experience.

    This cleanup reduces redundancy in the codebase and improves evaluation performance.

    Refs #2045

    ---------

    Co-authored-by: Bo Li <bo.li01@bytedance.com>
    (cherry picked from commit a19278c)

commit 8d3526c
Author: cocoshe <1228759711@qq.com>
Date:   Thu Mar 28 13:38:36 2024 +0800

    fix doc

* feat: Add LlavaOneVision model to available models

chore: Update sqlitedict dependency to version 2.1.0

* Revert "Squashed commit of the following:"

This reverts commit 11b00999df3c43cb225482e030b791b2d454124c.

* Refactor available models in lmms_eval

Remove duplicate entries for "llava_hf", "llava_onevision", and "longva" in the AVAILABLE_MODELS dictionary in lmms_eval/models/__init__.py.

* fix: Handle import errors in lmms_eval models/__init__.py

The code changes in this commit fix the handling of import errors in the lmms_eval/models/__init__.py file. Previously, when an import error occurred, the code simply ignored it. This commit updates the code to log an error message using the logger module when an import error occurs.

This commit also removes duplicate entries for "llava_hf", "llava_onevision", and "longva" in the AVAILABLE_MODELS dictionary.

Recent user commits:
- Refactor available models in lmms_eval
- Revert "Squashed commit of the following:"
- feat: Add LlavaOneVision model to available models
- chore: Update sqlitedict dependency to version 2.1.0

* fix: Handle import errors in lmms_eval models/__init__.py

* chore: Remove unused imports in lmms_eval/models/__init__.py and lmms_eval/tasks/vcr_wiki/utils.py

* Remove unused imports in lmms_eval/tasks/vcr_wiki/utils.py

* chore: Update lmms_eval/tasks/vcr_wiki/utils.py

This commit updates the `lmms_eval/tasks/vcr_wiki/utils.py` file. It removes unused imports and fixes the condition for loading Spacy models based on the `load_package` value in the config file. Additionally, it adds a debug log message when the Spacy models are not loaded due to `load_package` being set to False.

Remove unused imports in `lmms_eval/tasks/vcr_wiki/utils.py`

* feat: Add new subtasks to overall score calculation

The code changes in this commit add new subtasks to the overall score calculation in the `overall_score` function. The subtasks "ScanQA", "BLINK", "MathVerse", "SciVerse", and "Mantis" are included in the `categories` dictionary. This ensures that the scores for these subtasks are calculated and included in the evaluation results.

Remove unused imports and update subtask categories in `utils.py`

* feat: Add new subtasks to overall score calculation

* chore: Update lmms_eval/tasks/llava_interleave_bench/_default_template_interleave_yaml

Update the image aspect ratio in the default template for the llava_interleave_bench task. Change the value of "image_aspect_ratio" from "original" to "pad". This ensures that the generated images have a padded aspect ratio.

* if no response directly return 0

* Squashed commit of the following:

commit b2a009b
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Mon Jul 15 19:12:25 2024 -0700

    if no response directly return 0 (#142)

commit 5fc5f2f
Author: Kaichen Zhang - NTU <kaichenzhang358@outlook.com>
Date:   Tue Jul 16 10:12:11 2024 +0800

    Add Muirbench (#143)

    * handle gen kwargs in internvl2

    * Add muirbench

* Add files via upload

(cherry picked from commit 557083a)

* update

---------

Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
Co-authored-by: Yan Shu <570533048@qq.com>
kcz358 added a commit that referenced this pull request Sep 5, 2024
* fix doc

* [WIP] adding mmbench dev evaluation (#75)

* WIP

* Update GPT evaluation model name and sys prompt

* 🛠️ Scale accuracy to percentage

The accuracy value is now multiplied by 100 in the aggregation function to represent it as a percentage. Regarding the evaluation process, `math` module importation and refactoring reduce progress log verbosity by logging every 100 evaluations instead of 10. It prevents potential logging overflow. Handling of NaN values is added to ensure 'default_value' is set in case of missing data, avoiding errors in split, category, and l2-category assignments. Finally, reporting of categorical and l2-categorical accuracies is streamlined through a new `calculate_hit_rates` function, improving code readability and maintenance.

Issue refs: #1427, #1533

* Update GPT evaluation model name and API configuration

* Refactor MMBench_Evaluator class to handle missing columns

* Add print statements for detailed results in MMBench-CN(CC), MMBench-CN(Dev), and MMBench-EN(Dev) evaluations

* Refactor MMBench-CN and MMBench-EN evaluation functions

* 🔄 Refactor result processing and logging logic

- Simplified the result processing functions across different utility modules (`cc_utils.py`, `cn_utils.py`, `en_utils.py`) to unify the handling of multiple-choice options. Now, all options ("A" to "E") are dynamically added to the result data, and default to "nan" if not provided in the document.
- Removed redundant keys directly from the process results dict creation to avoid clutter and align with the new dynamic addition of options.
- In `mmbench_evals.py`, removed the unnecessary check for all splits being 'dev' and streamlined the evaluation loop by eliminating the progress bar (tqdm) for a cleaner log output.
- Commented-out code and verbose logging during evaluation, which may have interfered with performance, has been removed for a more efficient and less intrusive logging experience.

This cleanup reduces redundancy in the codebase and improves evaluation performance.

Refs #2045

---------

Co-authored-by: Bo Li <bo.li01@bytedance.com>
(cherry picked from commit a19278c2ea6ddcbca64d3cc7f4efec7fe5775121)

* Create README.md

* Add files via upload

* Add MathVerse

* Fix typo in qwen_vl that was causing "reference before assignment"

* convert contexts to list if necessary and remove unnecessary construction of `questions`

* refactor query construction for clarity

* Create ScreenSpot on clean branch

* Update README to reflect new tasks

* Add README file specific to ScreenSpot

* slight update

* Init webSRC

* Draft README for WebSRC

* Update main README with new task names

* Draft and validate websrc eval on dev split

* Add code to enable compilation of submission for WebSRC test split

* Bugfix: WebSRC should be token-level F1 NOT character-level

* Add qwen vl api

* Fix llava conv template for llama3

* Fix llava_hf generation for 1.6

* Parse result for llava_hf 1.6

* Add model_name parameter to Llava constructor

* Fix endless warning for llava_hf generation

* Fix llava_hf image tokens number issue

* Create LICENSE

* Update LICENSE

* Update LICENSE

* Better task list_with_num

* Fix idefics2 llava in the wild bugs

* Remove redundant code in fuyu

* Fix instructblip qformer size mismatch and multi-images problem

* Comment out parse result in xcomposer

* Comment out Spice in caption task so that don't need to download stanford nlp model

* Update gitignore

* Add separated pope tasks by category

* Fix pope random name in pope full

* Set printing info for llava_hf to debug level

* Adding Phi3v model.

* Adding prompt arguments for Phi3v on MathVista-TestMini

* Adding documentation of Phi3v class.

* [Fix] import issues of multilingual llava and olympiadbench

* fix compatibility issue of older version llava

* add upd

* add upd

* add upd

* add upd

* add upd

* add upd

* Group MMMU images into one image (#83)

* update

* update font

* Add matplotlib.font_manager import in utils.py

* Refactor font handling in add_order_label function in utils.py

* group mmmu

---------

Co-authored-by: Li Bo <drluodian@gmail.com>

* merge model_specific_prompt_kwargs and dataset_name into each task yaml

* Add MathVerse in README.md

* slightly change query_prompt for the reproduction

* update utils.py for leaderboard submission

* add conbench

* update README

* Update README.md

* init include vcr

* modify the form of VCR

* switch logic

* add crossed_text to vcr_wiki output

* include the try-except logic for spacy

* update vcr_wiki tasks

* update vcr_wiki tasks in README.md

* include std and confidence interval

* update gpt-3.5-turbo version

* update gpt-3.5-turbo version

* chore: Remove unnecessary files and code related to live_bench and sft_eval tasks

* Bump version to 0.2.0.dev0

* chore: Update lmms-eval to support video evaluations for LLaVA models

* Update llava conv_template in lmms_eval/models/llava.py

* Update image alignment in README.md

* chore: Update lmms-eval to support video evaluations for LLaVA models

* chore: Update lmms-eval to support video evaluations for LLaVA models

* Update README.md

* Update README.md

* update aggregation function for vcr_wiki

* update README.md

* Update README.md

* update version

* add II-Bench

* fix dataset_path

* Add qbench, qbench2, abench; fix phi3v as its current implementation does not support multi-image

* add tinyllava

* LongVideoBench support: image LMMs (idefics2, phi3) and video LMMs (LLaVA-Next-Video-34B)

* fix #117, allow auto download with tar format videos

* fix #117, allow auto download with tar format videos

* fix typo

* feat: Add support for auto downloading tar format videos

* Release llava-wilder

* chore: Update dependencies to fix potential risks and improve compatibility

* tutorial

* docs

* update preparation

* small fix

* small fix

* lint

* to sh script

* update readme

* Remove handling non-visual loop in llava

* Add llava_hf back to registry

* Update README.md

* Update README.md

* update ablation for videomme datasets

* chore: Handle ImportError when importing models

Handle the ImportError exception when importing models in the lmms_eval package. This change adds a try-except block to catch the ImportError and print an error message indicating the failed import. This will help with troubleshooting and identifying any issues with the model imports.

* chore: Remove unused models from lmms_eval package

* feat: Allow loading model configurations from other packages

* feat: Allow including external tasks from plugins

* chore: Add loguru for logging in lmms_eval package

* Remove unnecessary lines since use batched visuals now in llava

* Add longva

* Revise model registry for llava_hf and longva

* Delete unnecessary lines

* Remove unnecessary lines for video llava

* Update pyproject.toml

* Update activitynetqa_generation.yaml

* Fix vid mme post prompt issue

* new task gqa-ru

* add mmbench_ru_dev

* change prompt to ru

* create new task vitatecs

* Update README.md

* Add wild vision 0617

* Hardcode to keep image for wild vision

* Fixing scoring logic

* Fixing dataset name

* Fixing handling None filtered score

* Add detailcaps

* Add install capture_metric in env

* Add files via upload

* feat: Add tie_weights parameter to Llava model initialization

* Upgrade lmms-eval to support more models and evaluation tasks

* Upgrade lmms-eval to version 0.2.1

* Rename xcomposer 4KHD

* chore: Update lmms_eval/models/vila.py and lmms_eval/tasks/__init__.py

* Update utils.py

* Update _default_template_vcr_yaml

* add process sync via temp file in lmms_eval/evaluator.py

* Update utils.py

* Update _default_template_vcr_yaml

* Add muirbench

* Squashed commit of the following:

commit dfdba507b5fbe985b0030ffec575f9f2638bc1ed
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Jul 16 11:13:52 2024 +0800

    merge ov evals (#144)

    * chore: Update gpt_eval_model_name to "gpt-3.5-turbo" in mathvista.yaml

    * Squashed commit of the following:

    commit 994c9f97a2f8db3e9b7d7933d1e1680acde5b70b
    Author: Yan Shu <570533048@qq.com>
    Date:   Mon Jul 8 17:21:23 2024 +0800

        Add files via upload

    * Squashed commit of the following:

    commit e31cd7883d4555c7530795c7f102b8d78cbd372f
    Author: Bo Li <drluodian@gmail.com>
    Date:   Wed Jul 10 12:08:08 2024 +1000

        chore: Update lmms_eval/models/vila.py and lmms_eval/tasks/__init__.py

    commit 1d8c980d1089f9d7702c3b92d5c85039f2809c6d
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Tue Jul 9 02:08:52 2024 +0000

        Rename xcomposer 4KHD

    commit 6da76f36ecf5f9aa73057e767a4fcb60c99ff896
    Author: Bo Li <drluodian@gmail.com>
    Date:   Tue Jul 9 11:55:56 2024 +1000

        Upgrade lmms-eval to version 0.2.1

    commit cd1858523fcd8630082cbefba8710e0de3ee8805
    Author: Bo Li <drluodian@gmail.com>
    Date:   Tue Jul 9 11:52:23 2024 +1000

        Upgrade lmms-eval to support more models and evaluation tasks

    commit 672d7e5bb49dcb34e1b2fdeb09f3f4588dc583a6
    Author: Bo Li <drluodian@gmail.com>
    Date:   Tue Jul 9 11:43:41 2024 +1000

        feat: Add tie_weights parameter to Llava model initialization

    commit 2037a86261b55fa42b8ba3a04eab192b3e69d6ea
    Merge: e6844db1 a5c18692
    Author: Bo Li <drluodian@gmail.com>
    Date:   Tue Jul 9 11:37:12 2024 +1000

        Fix gen kwargs image aspect ratio in internvl2

    commit a5c186925de989b616f58a35ece36065a32b4594
    Merge: 2ebec77f 557083a1
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Jul 9 09:15:56 2024 +0800

        Merge pull request #137 from shuyansy/main

        add MLVU task

    commit 557083a156c3dd67ac79e22b4202e9b69b6b00f4
    Author: Yan Shu <570533048@qq.com>
    Date:   Mon Jul 8 16:56:50 2024 +0800

        Add files via upload

    commit 2ebec77f5606d79e9a7b995970e32792050606a1
    Merge: 211bfede b23d349e
    Author: Li Bo <drluodian@gmail.com>
    Date:   Mon Jul 8 11:53:06 2024 +0800

        Merge pull request #136 from Dousia/main

        Add detailcaps

    commit b23d349e46d60dc149ffaa54d6e019f4996ed92d
    Author: ByteDance <bytedance@MacBook-Pro.local>
    Date:   Sun Jul 7 23:24:19 2024 +0800

        Add install capture_metric in env

    commit c6e211d5f9dbb7572d3a141b6504cb1ca2007c33
    Author: ByteDance <bytedance@MacBook-Pro.local>
    Date:   Sun Jul 7 23:04:13 2024 +0800

        Add detailcaps

    commit 211bfedebad243ef82a8b0be36c3b5a9b9cb2f72
    Merge: 7c208b76 79514eee
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Jul 2 23:05:12 2024 +0800

        Merge pull request #133 from EvolvingLMMs-Lab/dev/wild_vision

        Add wild vision bench

    commit 79514eeebcfd6f655be2a10c776037d12a7b7214
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Mon Jul 1 15:10:02 2024 +0000

        Fixing handling None filtered score

    commit 725fac2781446958b905e1e6c6eb3c0a8e582e49
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Mon Jul 1 08:25:42 2024 +0000

        Fixing dataset name

    commit 8d963e132ac03fc0d835d480cfcfcabe72af143c
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Mon Jul 1 08:24:51 2024 +0000

        Fixing scoring logic

    commit e2990d0a69e876721256fdf946c68ba7ae0cbdc1
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Mon Jul 1 06:06:57 2024 +0000

        Hardcode to keep image for wild vision

    commit ed381736730d8fb785b4ee919fdb751734ecef25
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Mon Jul 1 06:06:38 2024 +0000

        Add wild vision 0617

    commit 7c208b76640c986cfe94233dce735c3ca4ad4319
    Author: Li Bo <drluodian@gmail.com>
    Date:   Mon Jul 1 11:53:31 2024 +0800

        Update README.md

    commit 39d40dea47bc59ff04e8b0cbc445345098debc9a
    Merge: e19b43a3 ba7081c0
    Author: Li Bo <drluodian@gmail.com>
    Date:   Mon Jul 1 11:47:09 2024 +0800

        Merge pull request #129 from Dannoopsy/mmbench_ru

        add task MMBench-ru

    commit e19b43a3a1e7212e623061b164b0419cc0dda689
    Merge: 11fd7e3f a0de8970
    Author: Li Bo <drluodian@gmail.com>
    Date:   Mon Jul 1 11:46:58 2024 +0800

        Merge pull request #128 from Dannoopsy/gqa-ru

        add task gqa-ru

    commit 11fd7e3fc05908aeb01e4a6161a7b55cd38b3122
    Merge: 383e7fea a7522592
    Author: Li Bo <drluodian@gmail.com>
    Date:   Mon Jul 1 11:46:16 2024 +0800

        Merge pull request #130 from lscpku/vitatecs

        Add task VITATECS

    commit a75225926e5954f85466d257f99acf0163fde596
    Author: lscpku <lisc99@pku.edu.cn>
    Date:   Fri Jun 28 20:37:06 2024 +0800

        create new task vitatecs

    commit ba7081c0abac840002d320e30733e891298dfa11
    Author: Dannoopsy <63581325+Dannoopsy@users.noreply.github.com>
    Date:   Fri Jun 28 12:21:05 2024 +0300

        change prompt to ru

    commit 27ea9c0055a8abf3a8198829b8617018479918e2
    Author: Dannoopsy <belopolskikh.dd@phystech.edu>
    Date:   Thu Jun 27 17:17:29 2024 +0000

        add mmbench_ru_dev

    commit 383e7fead3138aedf62e9c0ec48303835ef26e2a
    Merge: 06fa000f ed2e7f79
    Author: Li Bo <drluodian@gmail.com>
    Date:   Fri Jun 28 00:14:10 2024 +0800

        Merge pull request #126 from lorenzomammana/feature/external-package-integration

        External package integration using plugins

    commit ed2e7f792151d21bce8f1c498270b9391e1d5c85
    Merge: 03947e14 06fa000f
    Author: Lorenzo Mammana <mammanalorenzo@outlook.it>
    Date:   Thu Jun 27 15:38:10 2024 +0000

        Merge branch 'main' into feature/external-package-integration

    commit a0de89708d5e6f259bb17f0eaace3c5b901b275c
    Author: Dannoopsy <belopolskikh.dd@phystech.edu>
    Date:   Tue Jun 25 11:11:37 2024 +0000

        new task gqa-ru

    commit 06fa000f60d3e4d160fac8ceb9959ae92a98f752
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Tue Jun 25 06:41:13 2024 +0000

        Fix vid mme post prompt issue

    commit b388d79e0df6f60068196cb7047453ebd22d6ef1
    Author: Li Bo <drluodian@gmail.com>
    Date:   Sun Jun 23 22:31:16 2024 +0800

        Update activitynetqa_generation.yaml

    commit 8f9d620fcd9d0a0742ee6bcf51ea63bd6b088a36
    Author: Li Bo <drluodian@gmail.com>
    Date:   Sun Jun 23 14:02:25 2024 +0800

        Update pyproject.toml

    commit 6341b7c15ce9fb28eb06b067ddb299d6cf2e16c3
    Merge: fce85f1b 903b042b
    Author: Li Bo <drluodian@gmail.com>
    Date:   Sun Jun 23 14:02:02 2024 +0800

        Merge pull request #125 from EvolvingLMMs-Lab/dev/interleave

        [Model] aligned llava-interleave model results on video tasks

    commit 903b042be016016d4ebeecb07701f3076a2d323c
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Sat Jun 22 12:07:13 2024 +0000

        Remove unnecessary lines for video llava

    commit d78ec86407b729a964906a8c2e50704b4bc74d06
    Merge: ebe7217a fce85f1b
    Author: Li Bo <drluodian@gmail.com>
    Date:   Sat Jun 22 13:57:31 2024 +0800

        Merge branch 'main' into dev/interleave

    commit ebe7217a486c1e754e42c2cbdb834e09fbbcc9b0
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Sat Jun 22 02:57:08 2024 +0000

        Delete unnecessary lines

    commit 120c474b056f9177c74e1fd9691d59e2f234b785
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Fri Jun 21 08:38:41 2024 +0000

        Revise model registry for llava_hf and longva

    commit 7d6201f921088afd3f52a35076e3c6fcc9aa518c
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Fri Jun 21 08:38:24 2024 +0000

        Add longva

    commit 12f480699c71a12a24d4349d9b0681933201a3a6
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Fri Jun 21 08:35:39 2024 +0000

        Remove unnecessary lines since use batched visuals now in llava

    commit 12cea76f1f0f14b1fd1007c9d39a9b0557368637
    Author: Bo Li <drluodian@gmail.com>
    Date:   Thu Jun 20 18:15:32 2024 +0000

        chore: Add loguru for logging in lmms_eval package

    commit 03947e14a46fd25b412931f7c9c25f4a2971d0b4
    Author: Lorenzo Mammana <mammanalorenzo@outlook.it>
    Date:   Wed Jun 5 13:40:41 2024 +0000

        feat: Allow including external tasks from plugins

    commit b80a91f73e15ddd0b0ce1322d7d121fa14030eed
    Author: Lorenzo Mammana <mammanalorenzo@outlook.it>
    Date:   Wed Jun 5 13:04:55 2024 +0000

        feat: Allow loading model configurations from other packages

    commit 8ef24740dd48a11c97eb627f2fff4aca107fef0d
    Author: Bo Li <drluodian@gmail.com>
    Date:   Thu Jun 20 12:11:03 2024 +0000

        chore: Remove unused models from lmms_eval package

    commit af38885fc2e066f5ea44388f33e07176f836fe28
    Author: Bo Li <drluodian@gmail.com>
    Date:   Thu Jun 20 12:07:09 2024 +0000

        chore: Handle ImportError when importing models

        Handle the ImportError exception when importing models in the lmms_eval package. This change adds a try-except block to catch the ImportError and print an error message indicating the failed import. This will help with troubleshooting and identifying any issues with the model imports.

    commit fce85f1b03ff7043b29dee787c5d17a08dd2687a
    Merge: dbe63293 d94f83cb
    Author: Li Bo <drluodian@gmail.com>
    Date:   Thu Jun 20 20:02:12 2024 +0800

        Merge pull request #120 from EvolvingLMMs-Lab/pufanyi/hf_dataset_docs

        Add docs for datasets upload to HF

    commit dbe63293245a5141fdfd80bda7657c304f6bd32f
    Author: choiszt <ls2001927@sohu.com>
    Date:   Thu Jun 20 15:14:21 2024 +0800

        update ablation for videomme datasets

    commit d94f83cb3f08b61a2c75cc4326e58792100605b3
    Author: Li Bo <drluodian@gmail.com>
    Date:   Thu Jun 20 13:30:59 2024 +0800

        Update README.md

    commit cab8159ff35db330536c0b6dfb4b0a3b24142209
    Author: Li Bo <drluodian@gmail.com>
    Date:   Thu Jun 20 13:30:29 2024 +0800

        Update README.md

    commit 45876652a877a8006b828f32f5cc4660629f9190
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Thu Jun 20 03:55:30 2024 +0000

        Add llava_hf back to registry

    commit 3463651b8c54d36cd94169e3d376f5ed225a195a
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Thu Jun 20 03:54:33 2024 +0000

        Remove handling non-visual loop in llava

    commit cb0d3f49b72790b081f981e0e6147131542f7f68
    Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Date:   Thu Jun 20 02:11:18 2024 +0800

        update readme

    commit 813877bfe5ac590cdbe92dd74d18f83a2091f748
    Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Date:   Wed Jun 19 15:37:52 2024 +0800

        to sh script

    commit a14684b8557d5894976448a5c559ed7a66a6cf16
    Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Date:   Wed Jun 19 15:37:04 2024 +0800

        lint

    commit d0f8851d42ba31f5da2a7a65e91499db45174dbc
    Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Date:   Wed Jun 19 15:36:48 2024 +0800

        small fix

    commit 63748e9718f287ad433afc90e340b5e17a89c1ed
    Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Date:   Wed Jun 19 15:36:43 2024 +0800

        small fix

    commit 7f1159a1fe04cfb783dc31d4fbdef3bda0ce19e4
    Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Date:   Wed Jun 19 15:35:05 2024 +0800

        update preparation

    commit 19f9bd621c76a483ff98f8c7eb78f64753da683a
    Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Date:   Wed Jun 19 15:23:24 2024 +0800

        docs

    commit ce6f889ba02d819979c7922f6336cf4f1f718f65
    Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Date:   Wed Jun 19 15:04:16 2024 +0800

        tutorial

    commit f513c520c2a3dad26d2b2ca5c4ed4db05a493c73
    Author: Bo Li <drluodian@gmail.com>
    Date:   Wed Jun 19 06:51:19 2024 +0000

        chore: Update dependencies to fix potential risks and improve compatibility

    commit efb529552c5e4ba039a4cba8e9aa5cb7ba65bf90
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Wed Jun 19 10:25:58 2024 +0800

        Release llava-wilder

    commit 742651fc9daf97e2f57831ed6e6e7ee7ead7d555
    Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Date:   Wed Jun 19 07:44:26 2024 +0800

        feat: Add support for auto downloading tar format videos

    commit 511b6259828212fcba954cdeb8cf90d6e5daabf8
    Merge: 22a4958e 050b2c37
    Author: Bo Li <drluodian@gmail.com>
    Date:   Tue Jun 18 17:01:03 2024 +0000

        Merge branch 'main' of https://github.com/EvolvingLMMs-Lab/lmms-eval

    commit 050b2c370017e9b97475dd6cf01fd051b5ca5c86
    Merge: 74facb41 ef306512
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Jun 18 13:13:38 2024 +0800

        Merge pull request #114 from zjysteven/add-tinyllava

        add tinyllava

    commit ef306512e5135f76dffa383f600b8733015836e8
    Author: Jingyang Zhang <jingyang.zhang@duke.edu>
    Date:   Mon Jun 17 17:57:02 2024 -0400

        fix typo

    commit 9bab67732a4238097725deddf867fb1946ffee40
    Merge: dbfb2387 74facb41
    Author: Jingyang Zhang <jingyang.zhang@duke.edu>
    Date:   Sun Jun 16 10:56:05 2024 -0400

        Merge branch 'EvolvingLMMs-Lab:main' into add-tinyllava

    commit 74facb41a826691dfce4458cf1d8659b34fc5bf5
    Merge: 8ba192f9 d5df72de
    Author: Li Bo <drluodian@gmail.com>
    Date:   Sun Jun 16 17:59:19 2024 +0800

        Merge pull request #118 from teowu/main

        Fix the potential risk by PR #117

    commit d5df72de2d03108d6b365818ecc3551ac9aa6302
    Merge: 5bf59ed2 8ba192f9
    Author: Teo (Timothy) Wu Haoning <38696372+teowu@users.noreply.github.com>
    Date:   Sun Jun 16 15:32:13 2024 +0800

        Merge branch 'EvolvingLMMs-Lab:main' into main

    commit 5bf59ed250da98a408a94e214a73caa400cba842
    Author: teowu <realtimothyhwu@gmail.com>
    Date:   Sun Jun 16 07:27:28 2024 +0000

        fix #117, allow auto download with tar format videos

    commit 98b3955cb808e36303c030aea78eb037d1ec59ce
    Merge: a056f118 be9dada8
    Author: teowu <realtimothyhwu@gmail.com>
    Date:   Sun Jun 16 07:25:07 2024 +0000

        Merge branch 'main' of https://github.com/teowu/lmms-eval into main

    commit a056f118704eccec86ce32ab86981ce4bc1e1deb
    Author: teowu <realtimothyhwu@gmail.com>
    Date:   Sun Jun 16 07:23:54 2024 +0000

        fix #117, allow auto download with tar format videos

    commit 8ba192f94edf5d99598983445d5faa4f8807c49f
    Merge: 7cc28907 be9dada8
    Author: Li Bo <drluodian@gmail.com>
    Date:   Sat Jun 15 17:30:59 2024 +0800

        Merge pull request #117 from teowu/main

        LongVideoBench for LMMs-Eval

    commit be9dada8b4189c53c08e1674ab273242cf2f80a0
    Merge: 62ea8ceb 7cc28907
    Author: Teo (Timothy) Wu Haoning <38696372+teowu@users.noreply.github.com>
    Date:   Sat Jun 15 16:39:20 2024 +0800

        Merge pull request #1 from EvolvingLMMs-Lab/main

        Merge pull request #113 from teowu/main

    commit 62ea8ceb223ef2b51ebab2bcd50d5cf339c35cfe
    Author: teowu <realtimothyhwu@gmail.com>
    Date:   Sat Jun 15 08:30:11 2024 +0000

        LongVideoBench support: image LMMs (idefics2, phi3) and video LMMs (LLaVA-Next-Video-34B)

    commit 7cc28907edbb4eb58ee1398772a48110ea35dd96
    Merge: 4bc7224d ea14cd4b
    Author: Li Bo <drluodian@gmail.com>
    Date:   Sat Jun 15 14:10:22 2024 +0800

        Merge pull request #113 from teowu/main

        Q-Bench, Q-Bench2, A-Bench

    commit dbfb23873979f789477f4797ee2d6071e0fd921e
    Author: Jingyang <jingyang.zhang@duke.edu>
    Date:   Fri Jun 14 16:20:42 2024 -0400

        add tinyllava

    commit ea14cd4b361f4c95b3665cbdb95bc51754090eb5
    Author: teowu <realtimothyhwu@gmail.com>
    Date:   Fri Jun 14 15:01:52 2024 +0000

        Add qbench, qbench2, abench; fix phi3v as its current implementation does not support multi-image

    commit 4bc7224dcd27fe8b288bfc3fed4d7a9da9635658
    Merge: 2797987f bf14cb85
    Author: Li Bo <drluodian@gmail.com>
    Date:   Fri Jun 14 02:14:43 2024 +0800

        Merge pull request #111 from XinrunDu/main

        add II-Bench

    commit bf14cb8527b2b7ac438a36567a875168bc02d294
    Author: XinrunDu <duxinrun2000@gmail.com>
    Date:   Thu Jun 13 09:37:02 2024 +0000

        fix dataset_path

    commit 6248113f4e11a0ac396d31fa1b032a142fea8cb4
    Author: XinrunDu <duxinrun2000@gmail.com>
    Date:   Thu Jun 13 09:32:06 2024 +0000

        add II-Bench

    commit 2797987f5b88b87bd172714b678a75a1d8051826
    Merge: 63d82f1f 66d4bb2d
    Author: Li Bo <drluodian@gmail.com>
    Date:   Thu Jun 13 11:14:47 2024 +0800

        Merge pull request #109 from EvolvingLMMs-Lab/pufanyi/update_version

        [Small Update] Update the version of LMMs-Eval

    commit 66d4bb2d9c9afbbdea40196d4ad80e214d0b14b6
    Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Date:   Thu Jun 13 11:13:00 2024 +0800

        update version

    commit 63d82f1ff11eb430d91a15d6788a1f0b4d596850
    Author: Li Bo <drluodian@gmail.com>
    Date:   Thu Jun 13 11:04:32 2024 +0800

        Update README.md

    commit 44a33799671cb668f55366d5e5a4ddb051a3a1b4
    Merge: 5ed00356 0ce46d08
    Author: Li Bo <drluodian@gmail.com>
    Date:   Thu Jun 13 04:00:12 2024 +0800

        Merge pull request #105 from tianyu-z/main

        Include VCR

    commit 0ce46d088e473d12d63de44f17c67dceab25658c
    Author: Suyuchen <suyuchen.wang@umontreal.ca>
    Date:   Wed Jun 12 15:56:34 2024 -0400

        update README.md

    commit 46a88d8b0199ed44d2ff459fb372f2e006960cea
    Merge: 47b13b9b 5ed00356
    Author: Suyuchen <suyuchen.wang@umontreal.ca>
    Date:   Wed Jun 12 15:50:26 2024 -0400

        merged readme.md

    commit 47b13b9b320d36ac53b3622557e31239f7c22621
    Author: Suyuchen <suyuchen.wang@umontreal.ca>
    Date:   Wed Jun 12 15:30:52 2024 -0400

        update aggregation function for vcr_wiki

    commit 5ed00356676cf5d0ff056cf27d1b519b8e303ff7
    Author: Li Bo <drluodian@gmail.com>
    Date:   Thu Jun 13 03:21:42 2024 +0800

        Update README.md

    commit ed8806839db5988ced672bd162b7b046edb4863a
    Author: Li Bo <drluodian@gmail.com>
    Date:   Thu Jun 13 03:13:59 2024 +0800

        Update README.md

    commit fea3806026932a6e2bd6e538bcc413e33abdf245
    Merge: d99a24ab 05dc8e85
    Author: Li Bo <drluodian@gmail.com>
    Date:   Thu Jun 13 03:11:49 2024 +0800

        Merge pull request #108 from EvolvingLMMs-Lab/internal_main_dev

        [Upgrade to v0.2] Embracing Video Evaluations with LMMs-Eval

    commit 05dc8e853eab7c6bc782a1e2662d2efe7422f767
    Author: Bo Li <drluodian@gmail.com>
    Date:   Wed Jun 12 15:56:04 2024 +0000

        chore: Update lmms-eval to support video evaluations for LLaVA models

    commit cbeee20bc4ffb510a2b23d96cdaf4077be7c2a9e
    Author: Bo Li <drluodian@gmail.com>
    Date:   Wed Jun 12 15:50:30 2024 +0000

        chore: Update lmms-eval to support video evaluations for LLaVA models

    commit f00d5498b69dd4f7e54c907ac906abc7c128f000
    Author: Bo Li <drluodian@gmail.com>
    Date:   Wed Jun 12 15:46:33 2024 +0000

        Update image alignment in README.md

    commit 34156335db74cef9e3f0915d7172fd6b22456c15
    Author: Bo Li <drluodian@gmail.com>
    Date:   Wed Jun 12 15:43:16 2024 +0000

        Update llava conv_template in lmms_eval/models/llava.py

    commit 50575a950736bc8fc1e191310314cbb5fdff5720
    Author: Bo Li <drluodian@gmail.com>
    Date:   Wed Jun 12 15:39:03 2024 +0000

        chore: Update lmms-eval to support video evaluations for LLaVA models

    commit c9b2252fb8a15dd04252af5e6b4613855afd6ada
    Author: Bo Li <drluodian@gmail.com>
    Date:   Wed Jun 12 15:33:48 2024 +0000

        Bump version to 0.2.0.dev0

    commit 465bd4205e8097e9c037b24a3ed08dd6a7694efa
    Merge: e43bd840 d99a24ab
    Author: Bo Li <drluodian@gmail.com>
    Date:   Wed Jun 12 15:04:25 2024 +0000

        Merge branch 'main' of https://github.com/EvolvingLMMs-Lab/lmms-eval into internal_main_dev

    commit e43bd840b63eb499856e36d9d2ba45c924abcead
    Author: Bo Li <drluodian@gmail.com>
    Date:   Wed Jun 12 14:54:06 2024 +0000

        chore: Remove unnecessary files and code related to live_bench and sft_eval tasks

    commit d99a24abd06df10d07e5a4d0ad5030613f92f2e7
    Merge: 374590be a66003be
    Author: Li Bo <drluodian@gmail.com>
    Date:   Wed Jun 12 19:45:57 2024 +0800

        Merge pull request #107 from AtsuMiyai/new_task/upd_update

        update gpt-3.5-turbo version

    commit a66003befe4175824a1be6ed59f5f5b88c15f792
    Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
    Date:   Wed Jun 12 17:05:17 2024 +0900

        update gpt-3.5-turbo version

    commit ee91f272985f32eeb9cd6faa41afdd8eb49cac30
    Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
    Date:   Wed Jun 12 16:50:53 2024 +0900

        update gpt-3.5-turbo version

    commit 326b9694fc77398592b8caf3ba0bc2e2bb903813
    Author: tianyu-z <zhangtianyupro@gmail.com>
    Date:   Mon Jun 10 20:07:40 2024 -0400

        include std and confidence interval

    commit cd050d4a721d01a2ace0cd030cf7f8dc67eb8c4d
    Author: Suyuchen <suyuchen.wang@umontreal.ca>
    Date:   Mon Jun 10 18:49:47 2024 -0400

        update vcr_wiki tasks in README.md

    commit 205721e0aad76dde30255e56149bbed121883356
    Author: Suyuchen <suyuchen.wang@umontreal.ca>
    Date:   Mon Jun 10 18:43:15 2024 -0400

        update vcr_wiki tasks

    commit db8e718b502469e8536ee359c5559de87635ffc7
    Author: tianyu-z <zhangtianyupro@gmail.com>
    Date:   Mon Jun 10 16:13:58 2024 -0400

        include the try-except logic for spacy

    commit 427dabb790118f538b64e4e5bf6a7aab9689b3d9
    Author: Suyuchen <suyuchen.wang@umontreal.ca>
    Date:   Mon Jun 10 15:51:05 2024 -0400

        add crossed_text to vcr_wiki output

    commit 043b483eb55f7be4fea75c9bc0b9b03d251b109b
    Author: tianyu-z <zhangtianyupro@gmail.com>
    Date:   Mon Jun 10 15:47:00 2024 -0400

        switch logic

    commit e1f04db8f58dd10591fde335ea13f74cda7c79bd
    Author: tianyu-z <zhangtianyupro@gmail.com>
    Date:   Mon Jun 10 02:38:21 2024 -0400

        modify the form of VCR

    commit 96e8d9867c9549ab7490f4b12cfeb6a06238e0aa
    Author: tianyu-z <zhangtianyupro@gmail.com>
    Date:   Mon Jun 10 00:10:30 2024 -0400

        init include vcr

    commit 374590be62f988a76cf6704cfe394cd8ae7d4cb6
    Merge: 504685e2 cb3b9ce7
    Author: Kaichen Zhang - NTU <kaichenzhang358@outlook.com>
    Date:   Fri Jun 7 20:25:48 2024 +0800

        Merge pull request #101 from Gumpest/main

        Update conbench in README

    commit 504685e20b17659b913cf46f3012c16bf429e09d
    Author: Li Bo <drluodian@gmail.com>
    Date:   Thu Jun 6 15:42:15 2024 +0800

        Update README.md

    commit cb3b9ce71411da862ff01342a9122a3c656ffbd1
    Merge: c9793b38 67b64ea4
    Author: Yuan Zhang <56063339+Gumpest@users.noreply.github.com>
    Date:   Thu Jun 6 11:22:24 2024 +0800

        Merge branch 'EvolvingLMMs-Lab:main' into main

    commit c9793b3883714f254a700230b7bee781d6110e73
    Author: Yuan Zhang <gump_well_done@163.com>
    Date:   Thu Jun 6 11:21:05 2024 +0800

        update README

    commit 67b64ea44a5a39d96c7a196a8a8345a7486bd912
    Merge: 8ee7848a 5fd68451
    Author: Li Bo <drluodian@gmail.com>
    Date:   Wed Jun 5 23:12:58 2024 +0800

        Merge pull request #100 from Gumpest/main

        add Conbench

    commit 5fd684515c55ef643726c1b6c720c7cbd2183ba1
    Author: Yuan Zhang <gump_well_done@163.com>
    Date:   Wed Jun 5 21:52:31 2024 +0800

        add conbench

    commit 8ee7848aaa6383aa1f919c3f21199c81db3fff89
    Merge: 747e1978 6fefaf7c
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Jun 4 17:09:33 2024 +0800

        Merge pull request #95 from AtsuMiyai/new_task/upd

        add MM-UPD

    commit 747e19782996065cdce7157ee8c5e15beb5b6c59
    Merge: 4854a34d 05843072
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Jun 4 17:09:04 2024 +0800

        Merge pull request #97 from CaraJ7/update

        Add MathVerse in README.md

    commit 6fefaf7cea504e35583ee7217449da290295a7a4
    Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
    Date:   Tue Jun 4 17:36:39 2024 +0900

        update utils.py for leaderboard submission

    commit 5f4fe360def1c48ea0cb1da6409d192784882308
    Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
    Date:   Sun Jun 2 23:28:27 2024 +0900

        slightly change query_prompt for the reproduction

    commit 05843072d608b970bcada1cd0db65a3c80864060
    Author: CaraJ7 <1350074492@qq.com>
    Date:   Sun Jun 2 17:05:28 2024 +0800

        Add MathVerse in README.md

    commit 0581ab3cfb362e2024988b46fbbb00324f1233c9
    Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
    Date:   Fri May 31 16:09:45 2024 +0900

        merge model_specific_prompt_kwargs and dataset_name into each task yaml

    commit 4854a34d4d37efb5e201f2691ecdb054590cf20b
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Sat May 4 19:23:39 2024 +0800

        Group MMMU images into one image (#83)

        * update

        * update font

        * Add matplotlib.font_manager import in utils.py

        * Refactor font handling in add_order_label function in utils.py

        * group mmmu

        ---------

        Co-authored-by: Li Bo <drluodian@gmail.com>

    commit d224794c49520f4d28a31862cf977198cd6cbc5e
    Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
    Date:   Wed May 29 15:15:59 2024 +0900

        add upd

    commit 453e7936424220f02b99517059ca71babfbe5f5a
    Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
    Date:   Wed May 29 15:03:30 2024 +0900

        add upd

    commit 909edd6769ddcf8a546be4fdd129416687516878
    Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
    Date:   Wed May 29 12:52:21 2024 +0900

        add upd

    commit 7c1ac9706cafc4801fa4da181d2f610b7838c7b8
    Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
    Date:   Wed May 29 12:50:32 2024 +0900

        add upd

    commit 811301c5280ddd74986645086f026ab730c8848c
    Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
    Date:   Wed May 29 12:46:58 2024 +0900

        add upd

    commit 71401bafd1d515f704f86ab4817a758542bc4672
    Author: AtsuMiyai <miyai.atsuyuki.practice@gmail.com>
    Date:   Wed May 29 12:41:21 2024 +0900

        add upd

    commit 24dc435908d921e9f1a5706e3141b12e5d838d18
    Author: Bo Li <drluodian@gmail.com>
    Date:   Mon May 27 10:17:32 2024 +0000

        fix compatibility issue of older version llava

    commit 616edf43731415b35f0f5e97748ed2e017a2891d
    Author: Bo Li <drluodian@gmail.com>
    Date:   Mon May 27 09:32:26 2024 +0000

        [Fix] import issues of multilingual llava and olympiadbench

    commit 4c5a99e21a63fb0ee1c7d15546d18066e1d9894b
    Merge: 45c05b2b b05c3e22
    Author: Li Bo <drluodian@gmail.com>
    Date:   Mon May 27 14:19:53 2024 +0800

        Merge pull request #87 from vfragoso/vifragos/phi3v

        Adding microsoft/Phi-3-vision-128k-instruct model.

    commit b05c3e222fabd308dd7af4e04c1c6a0812962fe6
    Author: Victor Fragoso <victor.fragoso@microsoft.com>
    Date:   Fri May 24 16:36:37 2024 +0000

        Adding documentation of Phi3v class.

    commit c2008971308ce8168d57c24d00b725832f099244
    Author: Victor Fragoso <victor.fragoso@microsoft.com>
    Date:   Fri May 24 16:25:02 2024 +0000

        Adding prompt arguments for Phi3v on MathVista-TestMini

    commit 7f9fb6bcc6cd24a7b8011b8753d0ea98cc2451fd
    Author: Victor Fragoso <victor.fragoso@microsoft.com>
    Date:   Fri May 24 13:24:16 2024 +0000

        Adding Phi3v model.

    commit 45c05b2b2bece76e06849a52a0d034f9c0ac2367
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Thu May 23 03:47:36 2024 +0000

        Set printing info for llava_hf to debug level

    commit 53f013ed8278776551ca992562253387cc9968d2
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Thu May 23 03:41:39 2024 +0000

        Fix pope random name in pope full

    commit 22520a95f13334b75eee0cf0387151067a6bf516
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Thu May 23 03:41:14 2024 +0000

        Add separated pope tasks by category

    commit d1eefb1565014b47287ffa6b350229062f8f602f
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Thu May 9 08:36:02 2024 +0000

        Update gitignore

    commit b2b4dbd2dc13432c79208db35abf7f55c97f1790
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Mon May 20 07:45:11 2024 +0000

        Comment out Spice in caption task so that don't need to download stanford nlp model

    commit 662f05ce4c62a46a83f819d3a5925a9bd20059b5
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Mon May 20 03:13:13 2024 +0000

        Comment out parse result in xcomposer

    commit 09329322916bfbb604d72ddaf50441a0947f8805
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Thu May 16 03:55:39 2024 +0000

        Fix instructblip qformer size mismatch and multi-images problem

    commit 557a6a3b15e07e506bc05e2cc76ff6a2f8c93964
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Thu May 16 03:11:41 2024 +0000

        Remove redundant code in fuyu

    commit 6aeb5504e74ed1980b53700d8e4d4dcf7d1b38fc
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Thu May 16 01:45:24 2024 +0000

        Fix idefics2 llava in the wild bugs

    commit aea80e6a71f716951353e1e5d68380243396b4d6
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Wed May 15 11:07:35 2024 +0000

        Better task list_with_num

    commit 3c12a080d66b9c38f615b961befca7c30f82fa39
    Author: Li Bo <drluodian@gmail.com>
    Date:   Sat May 18 02:35:52 2024 +0800

        Update LICENSE

    commit 82317a635a4978b32e095a06cc295d0ae23661c2
    Author: Li Bo <drluodian@gmail.com>
    Date:   Sat May 18 02:29:09 2024 +0800

        Update LICENSE

    commit a8bba1cdb51061a0d27bf9a98cca1505b5c58ea5
    Author: Li Bo <drluodian@gmail.com>
    Date:   Sat May 18 02:28:03 2024 +0800

        Create LICENSE

    commit caa5893b5fd2c1d32c72b97f371ccd9a8d9ec3a0
    Merge: c0944486 423b0060
    Author: Li Bo <drluodian@gmail.com>
    Date:   Mon May 13 11:45:26 2024 +0800

        Merge pull request #73 from EvolvingLMMs-Lab/kc/qwen_vl_api

        [Feat] Add qwen vl api

    commit c09444860362a136f17641f8b2a1f91c2bbc3715
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Sat May 11 06:11:19 2024 +0000

        Fix llava_hf image tokens number issue

    commit 64f07e497f53e5bcbe9e8fb5830cc7a1daaf7ff1
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Thu May 9 02:04:10 2024 +0000

        Fix endless warning for llava_hf generation

    commit 8aaa828108da8514dd9cd23a9d6d83a8b67f2d65
    Author: Bo Li <drluodian@gmail.com>
    Date:   Thu May 2 06:13:56 2024 +0000

        Add model_name parameter to Llava constructor

    commit 7847dc4d8efe60605102414bb071b1da9851228e
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Tue May 7 03:15:59 2024 +0000

        Parse result for llava_hf 1.6

    commit 3e56b4f92db39a2ce92903b0c43a34f1d14d59ec
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Tue May 7 03:09:56 2024 +0000

        Fix llava_hf generation for 1.6

    commit fa3ff92b07ea5aaa633a2039818c310744f84d07
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Mon May 6 08:32:57 2024 +0000

        Fix llava conv template for llama3

    commit 423b00606aa77fd6b324c19e3d480b73ab852db6
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Sun May 5 07:54:52 2024 +0000

        Add qwen vl api

    commit b7fd7a9f7aa3c0e1e50374047dfffc46a7462b90
    Merge: 986139a9 c5a130b6
    Author: Li Bo <drluodian@gmail.com>
    Date:   Sun May 5 13:19:48 2024 +0800

        Merge pull request #59 from EvolvingLMMs-Lab/add_idefics2

        add idefics2

    commit 986139a9a31154679bdea029b09639f84712db27
    Merge: b46239ca 8d3526c0
    Author: Li Bo <drluodian@gmail.com>
    Date:   Fri May 3 01:18:18 2024 +0800

        Merge pull request #36 from cocoshe/main

        [Fix] repr llava doc

    commit b46239cabab7b545ec99d9eae6c851e531b18374
    Merge: bc69a744 373265f2
    Author: Li Bo <drluodian@gmail.com>
    Date:   Fri May 3 01:17:34 2024 +0800

        Merge pull request #56 from gagan3012/main

        Multilingual LLava bench

    commit bc69a744d2cffeb06eba62e843bcc7869e27613a
    Merge: eef3aeb6 626e8a91
    Author: Li Bo <drluodian@gmail.com>
    Date:   Fri May 3 01:12:14 2024 +0800

        Merge pull request #70 from hunterheiden/hsh/new_task/WebSRC

        Bugfix: WebSRC should be token-level F1 NOT character-level

    commit 626e8a91a4af2dd5dd774fc130cc2f4d74b2bc37
    Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
    Date:   Thu May 2 09:31:03 2024 -0400

        Bugfix: WebSRC should be token-level F1 NOT character-level

    commit eef3aeb6ab589bb1d5045af5b5c1984a69402d19
    Merge: c4e9dd9f 9bca4413
    Author: Li Bo <drluodian@gmail.com>
    Date:   Thu May 2 14:38:17 2024 +0800

        Merge pull request #69 from hunterheiden/hsh/new_task/WebSRC

        [New Task] WebSRC (multimodal Q&A on web screenshots)

    commit 9bca441376325173128e5c50087f068e519c48da
    Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
    Date:   Wed May 1 11:07:29 2024 -0400

        Add code to enable compilation of submission for WebSRC test split

    commit 7687495b1ed552eeba088cb9ad5aaf1170e7fff9
    Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
    Date:   Wed May 1 10:47:32 2024 -0400

        Draft and validate websrc eval on dev split

    commit 4eebd3e5d7ab3b8c3116eea57318db72d2ce32bb
    Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
    Date:   Wed May 1 10:46:54 2024 -0400

        Update main README with new task names

    commit 35fe80b67656114a8824eb59574089663bdc4c9a
    Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
    Date:   Wed May 1 10:46:20 2024 -0400

        Draft README for WebSRC

    commit 955bd0635cc6c14a96ad869f1002e6dbefdc5071
    Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
    Date:   Tue Apr 30 10:16:21 2024 -0400

        Init webSRC

    commit c4e9dd9f6e40e8586587c4a75987aa109a37f14b
    Merge: d8a3a99f 319afccb
    Author: Li Bo <drluodian@gmail.com>
    Date:   Fri Apr 26 14:37:22 2024 +0800

        Merge pull request #63 from hunterheiden/hsh/new_task/screenspot

        New Task: ScreenSpot - Grounding (REC) and instruction generation (REG) on screens

    commit 319afccbe713ddf40a8a6fa28501e64c0ad34725
    Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
    Date:   Thu Apr 25 11:44:34 2024 -0400

        slight update

    commit 2f3811ca1bbad6a441016b05fde09a571900fca8
    Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
    Date:   Thu Apr 25 11:41:04 2024 -0400

        Add README file specific to ScreenSpot

    commit 28962cbe83631ec5d6481aaea4907a7c96fec848
    Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
    Date:   Wed Apr 24 11:52:33 2024 -0400

        Update README to reflect new tasks

    commit e457cfb4f2d6869e8367d6d5b03ad25ee4acc363
    Author: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com>
    Date:   Tue Apr 23 18:33:16 2024 -0400

        Create ScreenSpot on clean branch

    commit d8a3a99ff6142fe101fa3c188cc7f29593c44345
    Merge: 3dcd0158 ed171293
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Apr 23 10:34:03 2024 +0800

        Merge pull request #61 from tupini07/patch-1

        Fix typo in Qwen-VL that was causing "reference before assignment"

    commit ed171293d1e82075c5c6a847fc91ecbfd45cf89f
    Author: Andrea Tupini <tupini07@gmail.com>
    Date:   Mon Apr 22 14:56:41 2024 -0600

        refactor query construction for clarity

    commit cd874201c46f32a2903ddffae85f9db73e14adfd
    Author: Andrea Tupini <tupini07@gmail.com>
    Date:   Mon Apr 22 14:54:29 2024 -0600

        convert contexts to list if necessary and remove unnecessary construction of `questions`

    commit 85573674e90c8d505312ba18c5102e0051255078
    Author: Andrea Tupini <tupini07@gmail.com>
    Date:   Mon Apr 22 14:47:33 2024 -0600

        Fix typo in qwen_vl that was causing "reference before assignment"

    commit 3dcd01582b719555bcf8eb25d91cc5e42abd2c5f
    Merge: 95df9fee 743673a1
    Author: Li Bo <drluodian@gmail.com>
    Date:   Sat Apr 20 22:03:16 2024 +0800

        Merge pull request #60 from CaraJ7/main

        Add MathVerse

    commit 743673a1419b6e729e18c96f148745cc739d4c71
    Merge: c1a54721 95df9fee
    Author: CaraJ7 <1350074492@qq.com>
    Date:   Sat Apr 20 21:49:02 2024 +0800

        Merge branch 'main' of https://github.com/EvolvingLMMs-Lab/lmms-eval

    commit c1a5472135c3b84061b64d997ab50dda0412ba4f
    Author: CaraJ7 <1350074492@qq.com>
    Date:   Sat Apr 20 21:45:34 2024 +0800

        Add MathVerse

    commit 373265f24e7a89cbd49ab724a2e388cc0930be78
    Author: Gagan Bhatia <49101362+gagan3012@users.noreply.github.com>
    Date:   Fri Apr 12 17:21:39 2024 -0700

        Add files via upload

    commit d8530514a5ef9378d2adeaceb228b60ec25a6718
    Author: Gagan Bhatia <49101362+gagan3012@users.noreply.github.com>
    Date:   Fri Apr 12 17:19:49 2024 -0700

        Create README.md

    commit 22a4958e993463edff352ac033014f9a485706cc
    Author: Bo Li <bo.li01@bytedance.com>
    Date:   Thu Apr 4 17:12:43 2024 +0000

        [WIP] adding mmbench dev evaluation (#75)

        * WIP

        * Update GPT evaluation model name and sys prompt

        * 🛠️ Scale accuracy to percentage

        The accuracy value is now multiplied by 100 in the aggregation function to represent it as a percentage. Regarding the evaluation process, `math` module importation and refactoring reduce progress log verbosity by logging every 100 evaluations instead of 10. It prevents potential logging overflow. Handling of NaN values is added to ensure 'default_value' is set in case of missing data, avoiding errors in split, category, and l2-category assignments. Finally, reporting of categorical and l2-categorical accuracies is streamlined through a new `calculate_hit_rates` function, improving code readability and maintenance.

        Issue refs: #1427, #1533

        * Update GPT evaluation model name and API configuration

        * Refactor MMBench_Evaluator class to handle missing columns

        * Add print statements for detailed results in MMBench-CN(CC), MMBench-CN(Dev), and MMBench-EN(Dev) evaluations

        * Refactor MMBench-CN and MMBench-EN evaluation functions

        * 🔄 Refactor result processing and logging logic

        - Simplified the result processing functions across different utility modules (`cc_utils.py`, `cn_utils.py`, `en_utils.py`) to unify the handling of multiple-choice options. Now, all options ("A" to "E") are dynamically added to the result data, and default to "nan" if not provided in the document.
        - Removed redundant keys directly from the process results dict creation to avoid clutter and align with the new dynamic addition of options.
        - In `mmbench_evals.py`, removed the unnecessary check for all splits being 'dev' and streamlined the evaluation loop by eliminating the progress bar (tqdm) for a cleaner log output.
        - Commented-out code and verbose logging during evaluation, which may have interfered with performance, has been removed for a more efficient and less intrusive logging experience.

        This cleanup reduces redundancy in the codebase and improves evaluation performance.

        Refs #2045

        ---------

        Co-authored-by: Bo Li <bo.li01@bytedance.com>
        (cherry picked from commit a19278c2ea6ddcbca64d3cc7f4efec7fe5775121)

    commit 8d3526c0869f0ad7747ff6bb02441140792b461c
    Author: cocoshe <1228759711@qq.com>
    Date:   Thu Mar 28 13:38:36 2024 +0800

        fix doc

    * feat: Add LlavaOneVision model to available models

    chore: Update sqlitedict dependency to version 2.1.0

    * Revert "Squashed commit of the following:"

    This reverts commit 11b00999df3c43cb225482e030b791b2d454124c.

    * Refactor available models in lmms_eval

    Remove duplicate entries for "llava_hf", "llava_onevision", and "longva" in the AVAILABLE_MODELS dictionary in lmms_eval/models/__init__.py.

    * fix: Handle import errors in lmms_eval models/__init__.py

    The code changes in this commit fix the handling of import errors in the lmms_eval/models/__init__.py file. Previously, when an import error occurred, the code simply ignored it. This commit updates the code to log an error message using the logger module when an import error occurs.

    This commit also removes duplicate entries for "llava_hf", "llava_onevision", and "longva" in the AVAILABLE_MODELS dictionary.

    Recent user commits:
    - Refactor available models in lmms_eval
    - Revert "Squashed commit of the following:"
    - feat: Add LlavaOneVision model to available models
    - chore: Update sqlitedict dependency to version 2.1.0

    * fix: Handle import errors in lmms_eval models/__init__.py

    * chore: Remove unused imports in lmms_eval/models/__init__.py and lmms_eval/tasks/vcr_wiki/utils.py

    * Remove unused imports in lmms_eval/tasks/vcr_wiki/utils.py

    * chore: Update lmms_eval/tasks/vcr_wiki/utils.py

    This commit updates the `lmms_eval/tasks/vcr_wiki/utils.py` file. It removes unused imports and fixes the condition for loading Spacy models based on the `load_package` value in the config file. Additionally, it adds a debug log message when the Spacy models are not loaded due to `load_package` being set to False.

    Remove unused imports in `lmms_eval/tasks/vcr_wiki/utils.py`

    * feat: Add new subtasks to overall score calculation

    The code changes in this commit add new subtasks to the overall score calculation in the `overall_score` function. The subtasks "ScanQA", "BLINK", "MathVerse", "SciVerse", and "Mantis" are included in the `categories` dictionary. This ensures that the scores for these subtasks are calculated and included in the evaluation results.

    Remove unused imports and update subtask categories in `utils.py`

    * feat: Add new subtasks to overall score calculation

    * chore: Update lmms_eval/tasks/llava_interleave_bench/_default_template_interleave_yaml

    Update the image aspect ratio in the default template for the llava_interleave_bench task. Change the value of "image_aspect_ratio" from "original" to "pad". This ensures that the generated images have a padded aspect ratio.

    * if no response directly return 0

    * Squashed commit of the following:

    commit b2a009b6bbf8353172f5a1dd9c29ea1f67610c02
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Mon Jul 15 19:12:25 2024 -0700

        if no response directly return 0 (#142)

    commit 5fc5f2f5acf454fc99448b0d62eb52b4bffba0d5
    Author: Kaichen Zhang - NTU <kaichenzhang358@outlook.com>
    Date:   Tue Jul 16 10:12:11 2024 +0800

        Add Muirbench (#143)

        * handle gen kwargs in internvl2

        * Add muirbench

    * Add files via upload

    (cherry picked from commit 557083a156c3dd67ac79e22b4202e9b69b6b00f4)

    * update

    ---------

    Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Co-authored-by: Yan Shu <570533048@qq.com>

commit b2a009b6bbf8353172f5a1dd9c29ea1f67610c02
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Mon Jul 15 19:12:25 2024 -0700

    if no response directly return 0 (#142)

commit 5fc5f2f5acf454fc99448b0d62eb52b4bffba0d5
Author: Kaichen Zhang - NTU <kaichenzhang358@outlook.com>
Date:   Tue Jul 16 10:12:11 2024 +0800

    Add Muirbench (#143)

    * handle gen kwargs in internvl2

    * Add muirbench

commit 4f8db1d37b1f824432927e74d6d82e06bb5aaed1
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Fri Jul 12 17:26:50 2024 -0700

    Upload live_bench results (#140)

    * upload results

    * add a readme

    * chore: Update upload_results.py script to use shell syntax

    * Update upload_results.py

    * Update upload_results.py

commit 18f3812c4f9af2e49af6b50e8afe7f607b8a75d6
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Jul 10 18:13:43 2024 -0700

    Load tasks only one time (#139)

    * chore: Initialize tasks only once to avoid re-initialization

    * chore: Initialize tasks only once to avoid re-initialization

    * chore: Refactor task initialization to avoid re-initialization

    * chore: Update task initialization to fix include_path issue

    * chore: Update task initialization to fix include_path issue

* chore: Remove unnecessary line in muirbench.yaml

* chore: Remove unnecessary line in muirbench.yaml and update gitignore

* chore: Update lmms_eval to use correct variable name for world size

* Update mmvet

* chore: Update lmms_eval to use correct variable name for world size

* chore: Remove unused lmms_eval configuration file

* refactor: Update lmms_eval to handle both image and video tasks

This commit updates the `Llava_OneVision` class in `llava_onevision.py` to handle both image and video tasks. It introduces conditional logic to differentiate between the two types of tasks and process the input accordingly. Additionally, it sets the image aspect ratio based on the number of visual inputs and the configuration settings.

Closes #123

* Fix llava onevision loglikelihood video bug

(cherry picked from commit f96e3e69fe86dcd9cb33d2bc18cc4ff2003de8be)

* refactor: Update mm_spatial_pool_mode to use bilinear interpolation

This commit updates the `mm_spatial_pool_mode` parameter in the `Llava_OneVision` class of `llava_onevision.py` to use bilinear interpolation instead of the previous average pooling mode. This change improves the spatial pooling process for the model.

Closes #456

* chore: Update pyproject.toml with protobuf dependency version 3.20

* Squashed commit of the following:

commit e106f49ceeb295fd4c89a0877073bc01b4b77c5f
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Thu Jul 25 08:14:03 2024 +0800

    livebench_july

commit a16295653fdda20d5e8c41c549d731ec422013e3
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Mon Jul 22 15:09:58 2024 +0800

    websites

commit 2cdc06ffe6ba53a4c707c1acf9fc5f2e7886b2b8
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Sun Jul 21 15:34:39 2024 +0800

    everything use gpt-4o

commit e67538d65526c58903d9e62d1914ebd39924ab67
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Sun Jul 21 14:29:55 2024 +0800

    chore: Update dataset capture settings in create_dataset.py

commit 0a3bb33d37cda05bb7bfba4ecf873c2860092a03
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Sun Jul 21 01:58:14 2024 +0800

    gpt-4-turbo => gpt-4o

commit 837f8b0400f04f4367f8f8f954afd64666d62fc6
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Sat Jul 20 16:48:04 2024 +0800

    chore: Update dataset name and version for live_bench task

commit fa58e730978b5536005c8bd0291abbeddd761205
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Sat Jul 20 15:05:13 2024 +0800

    generate data

commit faa96227a7af7bd6546578b2db68dce2acbc2c0c
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Sat Jul 20 13:15:18 2024 +0800

    fix

commit 60ea7ddb4fcd9f08013cd0d5b9dd8090f7e6b83e
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Sat Jul 20 13:12:31 2024 +0800

    fix bugs

commit 827d69d0bf967f5d69bfbee9848b4d568ca853b1
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Sat Jul 20 08:39:41 2024 +0800

    use claude to generate

commit b7e2619d1a51144cd434861ac151187aed82c8c4
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Sat Jul 20 07:36:59 2024 +0800

    extract information

commit f87d55d47cb0d6653765e9e3f988f4bc186f7d4c
Author: Fanyi Pu <FPU001@e.ntu.edu.sg>
Date:   Sat Jul 20 07:24:07 2024 +0800

    claude auto detect json mode

commit dfdba507b5fbe985b0030ffec575f9f2638bc1ed
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Jul 16 11:13:52 2024 +0800

    merge ov evals (#144)

    * chore: Update gpt_eval_model_name to "gpt-3.5-turbo" in mathvista.yaml

    * Squashed commit of the following:

    commit 994c9f97a2f8db3e9b7d7933d1e1680acde5b70b
    Author: Yan Shu <570533048@qq.com>
    Date:   Mon Jul 8 17:21:23 2024 +0800

        Add files via upload

    * Squashed commit of the following:

    commit e31cd7883d4555c7530795c7f102b8d78cbd372f
    Author: Bo Li <drluodian@gmail.com>
    Date:   Wed Jul 10 12:08:08 2024 +1000

        chore: Update lmms_eval/models/vila.py and lmms_eval/tasks/__init__.py

    commit 1d8c980d1089f9d7702c3b92d5c85039f2809c6d
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Tue Jul 9 02:08:52 2024 +0000

        Rename xcomposer 4KHD

    commit 6da76f36ecf5f9aa73057e767a4fcb60c99ff896
    Author: Bo Li <drluodian@gmail.com>
    Date:   Tue Jul 9 11:55:56 2024 +1000

        Upgrade lmms-eval to version 0.2.1

    commit cd1858523fcd8630082cbefba8710e0de3ee8805
    Author: Bo Li <drluodian@gmail.com>
    Date:   Tue Jul 9 11:52:23 2024 +1000

        Upgrade lmms-eval to support more models and evaluation tasks

    commit 672d7e5bb49dcb34e1b2fdeb09f3f4588dc583a6
    Author: Bo Li <drluodian@gmail.com>
    Date:   Tue Jul 9 11:43:41 2024 +1000

        feat: Add tie_weights parameter to Llava model initialization

    commit 2037a86261b55fa42b8ba3a04eab192b3e69d6ea
    Merge: e6844db1 a5c18692
    Author: Bo Li <drluodian@gmail.com>
    Date:   Tue Jul 9 11:37:12 2024 +1000

        Fix gen kwargs image aspect ratio in internvl2

    commit a5c186925de989b616f58a35ece36065a32b4594
    Merge: 2ebec77f 557083a1
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Jul 9 09:15:56 2024 +0800

        Merge pull request #137 from shuyansy/main

        add MLVU task

    commit 557083a156c3dd67ac79e22b4202e9b69b6b00f4
    Author: Yan Shu <570533048@qq.com>
    Date:   Mon Jul 8 16:56:50 2024 +0800

        Add files via upload

    commit 2ebec77f5606d79e9a7b995970e32792050606a1
    Merge: 211bfede b23d349e
    Author: Li Bo <drluodian@gmail.com>
    Date:   Mon Jul 8 11:53:06 2024 +0800

        Merge pull request #136 from Dousia/main

        Add detailcaps

    commit b23d349e46d60dc149ffaa54d6e019f4996ed92d
    Author: ByteDance <bytedance@MacBook-Pro.local>
    Date:   Sun Jul 7 23:24:19 2024 +0800

        Add install capture_metric in env

    commit c6e211d5f9dbb7572d3a141b6504cb1ca2007c33
    Author: ByteDance <bytedance@MacBook-Pro.local>
    Date:   Sun Jul 7 23:04:13 2024 +0800

        Add detailcaps

    commit 211bfedebad243ef82a8b0be36c3b5a9b9cb2f72
    Merge: 7c208b76 79514eee
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Jul 2 23:05:12 2024 +0800

        Merge pull request #133 from EvolvingLMMs-Lab/dev/wild_vision

        Add wild vision bench

    commit 79514eeebcfd6f655be2a10c776037d12a7b7214
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Mon Jul 1 15:10:02 2024 +0000

        Fixing handling None filtered score

    commit 725fac2781446958b905e1e6c6eb3c0a8e582e49
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Mon Jul 1 08:25:42 2024 +0000

        Fixing dataset name

    commit 8d963e132ac03fc0d835d480cfcfcabe72af143c
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Mon Jul 1 08:24:51 2024 +0000

        Fixing scoring logic

    commit e2990d0a69e876721256fdf946c68ba7ae0cbdc1
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Mon Jul 1 06:06:57 2024 +0000

        Hardcode to keep image for wild vision

    commit ed381736730d8fb785b4ee919fdb751734ecef25
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Mon Jul 1 06:06:38 2024 +0000

        Add wild vision 0617

    commit 7c208b76640c986cfe94233dce735c3ca4ad4319
    Author: Li Bo <drluodian@gmail.com>
    Date:   Mon Jul 1 11:53:31 2024 +0800

        Update README.md

    commit 39d40dea47bc59ff04e8b0cbc445345098debc9a
    Merge: e19b43a3 ba7081c0
    Author: Li Bo <drluodian@gmail.com>
    Date:   Mon Jul 1 11:47:09 2024 +0800

        Merge pull request #129 from Dannoopsy/mmbench_ru

        add task MMBench-ru

    commit e19b43a3a1e7212e623061b164b0419cc0dda689
    Merge: 11fd7e3f a0de8970
    Author: Li Bo <drluodian@gmail.com>
    Date:   Mon Jul 1 11:46:58 2024 +0800

        Merge pull request #128 from Dannoopsy/gqa-ru

        add task gqa-ru

    commit 11fd7e3fc05908aeb01e4a6161a7b55cd38b3122
    Merge: 383e7fea a7522592
    Author: Li Bo <drluodian@gmail.com>
    Date:   Mon Jul 1 11:46:16 2024 +0800

        Merge pull request #130 from lscpku/vitatecs

        Add task VITATECS

    commit a75225926e5954f85466d257f99acf0163fde596
    Author: lscpku <lisc99@pku.edu.cn>
    Date:   Fri Jun 28 20:37:06 2024 +0800

        create new task vitatecs

    commit ba7081c0abac840002d320e30733e891298dfa11
    Author: Dannoopsy <63581325+Dannoopsy@users.noreply.github.com>
    Date:   Fri Jun 28 12:21:05 2024 +0300

        change prompt to ru

    commit 27ea9c0055a8abf3a8198829b8617018479918e2
    Author: Dannoopsy <belopolskikh.dd@phystech.edu>
    Date:   Thu Jun 27 17:17:29 2024 +0000

        add mmbench_ru_dev

    commit 383e7fead3138aedf62e9c0ec48303835ef26e2a
    Merge: 06fa000f ed2e7f79
    Author: Li Bo <drluodian@gmail.com>
    Date:   Fri Jun 28 00:14:10 2024 +0800

        Merge pull request #126 from lorenzomammana/feature/external-package-integration

        External package integration using plugins

    commit ed2e7f792151d21bce8f1c498270b9391e1d5c85
    Merge: 03947e14 06fa000f
    Author: Lorenzo Mammana <mammanalorenzo@outlook.it>
    Date:   Thu Jun 27 15:38:10 2024 +0000

        Merge branch 'main' into feature/external-package-integration

    commit a0de89708d5e6f259bb17f0eaace3c5b901b275c
    Author: Dannoopsy <belopolskikh.dd@phystech.edu>
    Date:   Tue Jun 25 11:11:37 2024 +0000

        new task gqa-ru

    commit 06fa000f60d3e4d160fac8ceb9959ae92a98f752
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Tue Jun 25 06:41:13 2024 +0000

        Fix vid mme post prompt issue

    commit b388d79e0df6f60068196cb7047453ebd22d6ef1
    Author: Li Bo <drluodian@gmail.com>
    Date:   Sun Jun 23 22:31:16 2024 +0800

        Update activitynetqa_generation.yaml

    commit 8f9d620fcd9d0a0742ee6bcf51ea63bd6b088a36
    Author: Li Bo <drluodian@gmail.com>
    Date:   Sun Jun 23 14:02:25 2024 +0800

        Update pyproject.toml

    commit 6341b7c15ce9fb28eb06b067ddb299d6cf2e16c3
    Merge: fce85f1b 903b042b
    Author: Li Bo <drluodian@gmail.com>
    Date:   Sun Jun 23 14:02:02 2024 +0800

        Merge pull request #125 from EvolvingLMMs-Lab/dev/interleave

        [Model] aligned llava-interleave model results on video tasks

    commit 903b042be016016d4ebeecb07701f3076a2d323c
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Sat Jun 22 12:07:13 2024 +0000

        Remove unnecessary lines for video llava

    commit d78ec86407b729a964906a8c2e50704b4bc74d06
    Merge: ebe7217a fce85f1b
    Author: Li Bo <drluodian@gmail.com>
    Date:   Sat Jun 22 13:57:31 2024 +0800

        Merge branch 'main' into dev/interleave

    commit ebe7217a486c1e754e42c2cbdb834e09fbbcc9b0
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Sat Jun 22 02:57:08 2024 +0000

        Delete unnecessary lines

    commit 120c474b056f9177c74e1fd9691d59e2f234b785
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Fri Jun 21 08:38:41 2024 +0000

        Revise model registry for llava_hf and longva

    commit 7d6201f921088afd3f52a35076e3c6fcc9aa518c
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Fri Jun 21 08:38:24 2024 +0000

        Add longva

    commit 12f480699c71a12a24d4349d9b0681933201a3a6
    Author: kcz358 <kaichenzhang358@outlook.com>
    Date:   Fri Jun 21 08:35:39 2024 +0000

        Remove unnecessary lines since use batched visuals now in llava

    commit 12cea76f1f0f14b1fd1007c9d39a9b0557368637
    Author: Bo Li <drluodian@gmail.com>
    Date:   Thu Jun 20 18:15:32 2024 +0000

        chore: Add loguru for logging in lmms_eval package

    commit 03947e14a46fd25b412931f7c9c25f4a2971d0b4
    Author: Lorenzo Mammana <mammanalorenzo@outlook.it>
    Date:   Wed Jun 5 13:40:41 2024 +0000

        feat: Allow including external tasks from plugins

    commit b80a91f73e15ddd0b0ce1322d7d121fa14030eed
    Author: Lorenzo Mammana <mammanalorenzo@outlook.it>
    Date:   Wed Jun 5 13:04:55 2024 +0000

        feat: Allow loading model configurations from other packages

    commit 8ef24740dd48a11c97eb627f2fff4aca107fef0d
    Author: Bo Li <drluodian@gmail.com>
    Date:   Thu Jun 20 12:11:03 2024 +0000

        chore: Remove unused models from lmms_eval package

    commit af38885fc2e066f5ea44388f33e07176f836fe28
    Author: Bo Li <drluodian@gmail.com>
    Date:   Thu Jun 20 12:07:09 2024 +0000

        chore: Handle ImportError when importing models

        Handle the ImportError exception when importing models in the lmms_eval package. This change adds a try-except block to catch the ImportError and print an error message indicating the failed import. This will help with troubleshooting and identifying any issues with the model imports.

    commit fce85f1b03ff7043b29dee787c5d17a08dd2687a
    Merge: dbe63293 d94f83cb
    Author: Li Bo <drluodian@gmail.com>
    Date:   Thu Jun 20 20:02:12 2024 +0800

        Merge pull request #120 from EvolvingLMMs-Lab/pufanyi/hf_dataset_docs

        Add docs for datasets upload to HF

    commit dbe63293245a5141fdfd80bda7657c304f6bd32f
    Author: choiszt <ls2001927@sohu.com>
    Date:   Thu Jun 20 15:14:21 2024 +0800

        u…
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
* refactor vizwizvqa task

* Delete vqav2_test and vqav2_val YAML files

* Refactor vqav2_process_results functions

* Add a pack for vqav2

* refactor okvqa

* roll back vizwiz_vqa

* Fix exact_match calculation in ok_vqa_process_results

* Update OKVQA dataset name in readme

* add model_specific_prompt_kwargs

* add model_specific_prompt_kwargs to vizwiz_vqa

* add model_specific_prompt_kwargs for vqav2

* lint

* fix a small bug for eval_logger

* Refactor make_table function to display points as "  -  " if value is None

* Merge commit '63fc8eee4dddfbe741e5a862e5ff30d19c34238e'

* Refactor ok_vqa_aggreate_submissions function

* Merge commit 'd16bbce134d453c624834e090af1e0f869fdde15'

* Refactor VQA submission file saving

* Update file utils

* Merge commit '7332704263a45ab6fa69aad0c4303cd9cbc26813'

* Refactor file path handling and submission generation

* OKVQA path

* vizwizvqa file

* pack cmmmu

* fix a small metric bug for cmmmu

* Add higher_is_better flag to submission metric

* Add CMMMU dataset to README.md

* Add logging and refactor submission file generation in docvqa utils.py

* pack docvqa

* add traceback to print detailed error

* Refactor docvqa_test_aggregate_results to accept additional arguments

* Add metric check in evaluator.py and update test.yaml and val.yaml

* add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

* merge textvqa

* textvqa

* Modify submission file generation for COCO test results

* Update test result storage path

* update coco cap file name

* Update COCO 2017 Caption dataset name

* ferret

* Add Ferret dataset

* Refactor hb_doc_to_text function to include model-specific prompts

* Add IconQA and its subtasks

* Refactor image list creation in doc_to_visual function

* Add process_results function to default template

* Update process_results function in iconqa utils.py

* refactor flickr30k

* change aggregation function

* Fix formatting issues and update logging message

* Fix llava can not handle only text question (no visuals)

* Fix qwen can not handle no image question (no visuals)

* Add fuyu prepare accelerator scripts

* refactor mme

* naming consistency

* aggregation_submissions consistency

* flickr30k naming consistency

* remove submissions for mme

* remove unused submission function

* Refactor infovqa_test.yaml and infovqa_val.yaml

* Refactor code for improved readability and maintainability

* stvqa

* remane sqa

* Update lmms_eval textcaps files and utils.py

* Update default prompt for text captions

* Refactor textcaps_aggregation_result function

* Add generate_submission_file function and update mathvista_aggregate_results signature

* Update nocaps_test.yaml and nocaps_val.yaml

* refractor internal_eval

* Add internal evaluation datasets

* pack multidocvqa

* mmvet

* Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

* Refractor llava wild

* Refractor llava-bench-coco

* Add JSON file generation for gpt evaluation details

* mmmu

* Remove MMBench English and Chinese tasks

* Remove unnecessary return statement in mmbench_aggregate_test_results function

* Fix distributed process group initialization

* Update dataset paths and group names in mmbench test configs

* Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

* Add torch module import

* lint

* Remove IconQA dataset from README.md

* Add Multi-DocVQA and its submodules

* Add new datasets and update task names

* Refactor flickr_aggregation_result function to accept additional arguments

* Add timeout kwargs in Accelerator constructor

* Add encoding to be utf-8 for cmmmu

* Fix llava try and catch, remove torch.distributed.init in main

* Ds prepare script for llava

---------

Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
Co-authored-by: kcz358 <kaichenzhang358@outlook.com>
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
* refactor vizwizvqa task

* Delete vqav2_test and vqav2_val YAML files

* Refactor vqav2_process_results functions

* Add a pack for vqav2

* refactor okvqa

* roll back vizwiz_vqa

* Fix exact_match calculation in ok_vqa_process_results

* Update OKVQA dataset name in readme

* add model_specific_prompt_kwargs

* add model_specific_prompt_kwargs to vizwiz_vqa

* add model_specific_prompt_kwargs for vqav2

* lint

* fix a small bug for eval_logger

* Refactor make_table function to display points as "  -  " if value is None

* Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

* Refactor ok_vqa_aggreate_submissions function

* Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

* Refactor VQA submission file saving

* Update file utils

* Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

* Refactor file path handling and submission generation

* OKVQA path

* vizwizvqa file

* pack cmmmu

* fix a small metric bug for cmmmu

* Add higher_is_better flag to submission metric

* Add CMMMU dataset to README.md

* Add logging and refactor submission file generation in docvqa utils.py

* pack docvqa

* add traceback to print detailed error

* Refactor docvqa_test_aggregate_results to accept additional arguments

* Add metric check in evaluator.py and update test.yaml and val.yaml

* add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

* merge textvqa

* textvqa

* Modify submission file generation for COCO test results

* Update test result storage path

* update coco cap file name

* Update COCO 2017 Caption dataset name

* ferret

* Add Ferret dataset

* Refactor hb_doc_to_text function to include model-specific prompts

* Add IconQA and its subtasks

* Refactor image list creation in doc_to_visual function

* Add process_results function to default template

* Update process_results function in iconqa utils.py

* refactor flickr30k

* change aggregation function

* Fix formatting issues and update logging message

* Fix llava can not handle only text question (no visuals)

* Fix qwen can not handle no image question (no visuals)

* Add fuyu prepare accelerator scripts

* refactor mme

* naming consistency

* aggregation_submissions consistency

* flickr30k naming consistency

* remove submissions for mme

* remove unused submission function

* Refactor infovqa_test.yaml and infovqa_val.yaml

* Refactor code for improved readability and maintainability

* stvqa

* remane sqa

* Update lmms_eval textcaps files and utils.py

* Update default prompt for text captions

* Refactor textcaps_aggregation_result function

* Add generate_submission_file function and update mathvista_aggregate_results signature

* Update nocaps_test.yaml and nocaps_val.yaml

* refractor internal_eval

* Add internal evaluation datasets

* pack multidocvqa

* mmvet

* Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

* Refractor llava wild

* Refractor llava-bench-coco

* Add JSON file generation for gpt evaluation details

* mmmu

* Remove MMBench English and Chinese tasks

* Remove unnecessary return statement in mmbench_aggregate_test_results function

* Fix distributed process group initialization

* Update dataset paths and group names in mmbench test configs

* Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

* Add torch module import

* lint

* Remove IconQA dataset from README.md

* Add Multi-DocVQA and its submodules

* Add new datasets and update task names

* Refactor flickr_aggregation_result function to accept additional arguments

* Add timeout kwargs in Accelerator constructor

* Add encoding to be utf-8 for cmmmu

* Fix llava try and catch, remove torch.distributed.init in main

* Ds prepare script for llava

---------

Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
Co-authored-by: kcz358 <kaichenzhang358@outlook.com>
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
* refactor vizwizvqa task

* Delete vqav2_test and vqav2_val YAML files

* Refactor vqav2_process_results functions

* Add a pack for vqav2

* refactor okvqa

* roll back vizwiz_vqa

* Fix exact_match calculation in ok_vqa_process_results

* Update OKVQA dataset name in readme

* add model_specific_prompt_kwargs

* add model_specific_prompt_kwargs to vizwiz_vqa

* add model_specific_prompt_kwargs for vqav2

* lint

* fix a small bug for eval_logger

* Refactor make_table function to display points as "  -  " if value is None

* Merge commit '5e73e8b8a2408bd8193361788669ca80db19cb04'

* Refactor ok_vqa_aggreate_submissions function

* Merge commit '40099e8b8145bde513b9b7cef8461d8f13d1eafe'

* Refactor VQA submission file saving

* Update file utils

* Merge commit 'a56fe11c00ad4a8b8967be88b93baef6649528c5'

* Refactor file path handling and submission generation

* OKVQA path

* vizwizvqa file

* pack cmmmu

* fix a small metric bug for cmmmu

* Add higher_is_better flag to submission metric

* Add CMMMU dataset to README.md

* Add logging and refactor submission file generation in docvqa utils.py

* pack docvqa

* add traceback to print detailed error

* Refactor docvqa_test_aggregate_results to accept additional arguments

* Add metric check in evaluator.py and update test.yaml and val.yaml

* add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

* merge textvqa

* textvqa

* Modify submission file generation for COCO test results

* Update test result storage path

* update coco cap file name

* Update COCO 2017 Caption dataset name

* ferret

* Add Ferret dataset

* Refactor hb_doc_to_text function to include model-specific prompts

* Add IconQA and its subtasks

* Refactor image list creation in doc_to_visual function

* Add process_results function to default template

* Update process_results function in iconqa utils.py

* refactor flickr30k

* change aggregation function

* Fix formatting issues and update logging message

* Fix llava can not handle only text question (no visuals)

* Fix qwen can not handle no image question (no visuals)

* Add fuyu prepare accelerator scripts

* refactor mme

* naming consistency

* aggregation_submissions consistency

* flickr30k naming consistency

* remove submissions for mme

* remove unused submission function

* Refactor infovqa_test.yaml and infovqa_val.yaml

* Refactor code for improved readability and maintainability

* stvqa

* remane sqa

* Update lmms_eval textcaps files and utils.py

* Update default prompt for text captions

* Refactor textcaps_aggregation_result function

* Add generate_submission_file function and update mathvista_aggregate_results signature

* Update nocaps_test.yaml and nocaps_val.yaml

* refractor internal_eval

* Add internal evaluation datasets

* pack multidocvqa

* mmvet

* Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

* Refractor llava wild

* Refractor llava-bench-coco

* Add JSON file generation for gpt evaluation details

* mmmu

* Remove MMBench English and Chinese tasks

* Remove unnecessary return statement in mmbench_aggregate_test_results function

* Fix distributed process group initialization

* Update dataset paths and group names in mmbench test configs

* Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

* Add torch module import

* lint

* Remove IconQA dataset from README.md

* Add Multi-DocVQA and its submodules

* Add new datasets and update task names

* Refactor flickr_aggregation_result function to accept additional arguments

* Add timeout kwargs in Accelerator constructor

* Add encoding to be utf-8 for cmmmu

* Fix llava try and catch, remove torch.distributed.init in main

* Ds prepare script for llava

---------

Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
Co-authored-by: kcz358 <kaichenzhang358@outlook.com>
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
* refactor vizwizvqa task

* Delete vqav2_test and vqav2_val YAML files

* Refactor vqav2_process_results functions

* Add a pack for vqav2

* refactor okvqa

* roll back vizwiz_vqa

* Fix exact_match calculation in ok_vqa_process_results

* Update OKVQA dataset name in readme

* add model_specific_prompt_kwargs

* add model_specific_prompt_kwargs to vizwiz_vqa

* add model_specific_prompt_kwargs for vqav2

* lint

* fix a small bug for eval_logger

* Refactor make_table function to display points as "  -  " if value is None

* Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

* Refactor ok_vqa_aggreate_submissions function

* Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

* Refactor VQA submission file saving

* Update file utils

* Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

* Refactor file path handling and submission generation

* OKVQA path

* vizwizvqa file

* pack cmmmu

* fix a small metric bug for cmmmu

* Add higher_is_better flag to submission metric

* Add CMMMU dataset to README.md

* Add logging and refactor submission file generation in docvqa utils.py

* pack docvqa

* add traceback to print detailed error

* Refactor docvqa_test_aggregate_results to accept additional arguments

* Add metric check in evaluator.py and update test.yaml and val.yaml

* add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

* merge textvqa

* textvqa

* Modify submission file generation for COCO test results

* Update test result storage path

* update coco cap file name

* Update COCO 2017 Caption dataset name

* ferret

* Add Ferret dataset

* Refactor hb_doc_to_text function to include model-specific prompts

* Add IconQA and its subtasks

* Refactor image list creation in doc_to_visual function

* Add process_results function to default template

* Update process_results function in iconqa utils.py

* refactor flickr30k

* change aggregation function

* Fix formatting issues and update logging message

* Fix llava can not handle only text question (no visuals)

* Fix qwen can not handle no image question (no visuals)

* Add fuyu prepare accelerator scripts

* refactor mme

* naming consistency

* aggregation_submissions consistency

* flickr30k naming consistency

* remove submissions for mme

* remove unused submission function

* Refactor infovqa_test.yaml and infovqa_val.yaml

* Refactor code for improved readability and maintainability

* stvqa

* remane sqa

* Update lmms_eval textcaps files and utils.py

* Update default prompt for text captions

* Refactor textcaps_aggregation_result function

* Add generate_submission_file function and update mathvista_aggregate_results signature

* Update nocaps_test.yaml and nocaps_val.yaml

* refractor internal_eval

* Add internal evaluation datasets

* pack multidocvqa

* mmvet

* Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

* Refractor llava wild

* Refractor llava-bench-coco

* Add JSON file generation for gpt evaluation details

* mmmu

* Remove MMBench English and Chinese tasks

* Remove unnecessary return statement in mmbench_aggregate_test_results function

* Fix distributed process group initialization

* Update dataset paths and group names in mmbench test configs

* Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

* Add torch module import

* lint

* Remove IconQA dataset from README.md

* Add Multi-DocVQA and its submodules

* Add new datasets and update task names

* Refactor flickr_aggregation_result function to accept additional arguments

* Add timeout kwargs in Accelerator constructor

* Add encoding to be utf-8 for cmmmu

* Fix llava try and catch, remove torch.distributed.init in main

* Ds prepare script for llava

---------

Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
Co-authored-by: kcz358 <kaichenzhang358@outlook.com>
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
* refactor vizwizvqa task

* Delete vqav2_test and vqav2_val YAML files

* Refactor vqav2_process_results functions

* Add a pack for vqav2

* refactor okvqa

* roll back vizwiz_vqa

* Fix exact_match calculation in ok_vqa_process_results

* Update OKVQA dataset name in readme

* add model_specific_prompt_kwargs

* add model_specific_prompt_kwargs to vizwiz_vqa

* add model_specific_prompt_kwargs for vqav2

* lint

* fix a small bug for eval_logger

* Refactor make_table function to display points as "  -  " if value is None

* Merge commit 'b636596c46dce543cdfacc0809c5b14edafcf1fd'

* Refactor ok_vqa_aggreate_submissions function

* Merge commit '5624cd5b92ff6b1bc1d431a615d938fd623a03c4'

* Refactor VQA submission file saving

* Update file utils

* Merge commit '034d73b022739333da5e60f432330b8ea832ef9b'

* Refactor file path handling and submission generation

* OKVQA path

* vizwizvqa file

* pack cmmmu

* fix a small metric bug for cmmmu

* Add higher_is_better flag to submission metric

* Add CMMMU dataset to README.md

* Add logging and refactor submission file generation in docvqa utils.py

* pack docvqa

* add traceback to print detailed error

* Refactor docvqa_test_aggregate_results to accept additional arguments

* Add metric check in evaluator.py and update test.yaml and val.yaml

* add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

* merge textvqa

* textvqa

* Modify submission file generation for COCO test results

* Update test result storage path

* update coco cap file name

* Update COCO 2017 Caption dataset name

* ferret

* Add Ferret dataset

* Refactor hb_doc_to_text function to include model-specific prompts

* Add IconQA and its subtasks

* Refactor image list creation in doc_to_visual function

* Add process_results function to default template

* Update process_results function in iconqa utils.py

* refactor flickr30k

* change aggregation function

* Fix formatting issues and update logging message

* Fix llava can not handle only text question (no visuals)

* Fix qwen can not handle no image question (no visuals)

* Add fuyu prepare accelerator scripts

* refactor mme

* naming consistency

* aggregation_submissions consistency

* flickr30k naming consistency

* remove submissions for mme

* remove unused submission function

* Refactor infovqa_test.yaml and infovqa_val.yaml

* Refactor code for improved readability and maintainability

* stvqa

* remane sqa

* Update lmms_eval textcaps files and utils.py

* Update default prompt for text captions

* Refactor textcaps_aggregation_result function

* Add generate_submission_file function and update mathvista_aggregate_results signature

* Update nocaps_test.yaml and nocaps_val.yaml

* refractor internal_eval

* Add internal evaluation datasets

* pack multidocvqa

* mmvet

* Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

* Refractor llava wild

* Refractor llava-bench-coco

* Add JSON file generation for gpt evaluation details

* mmmu

* Remove MMBench English and Chinese tasks

* Remove unnecessary return statement in mmbench_aggregate_test_results function

* Fix distributed process group initialization

* Update dataset paths and group names in mmbench test configs

* Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

* Add torch module import

* lint

* Remove IconQA dataset from README.md

* Add Multi-DocVQA and its submodules

* Add new datasets and update task names

* Refactor flickr_aggregation_result function to accept additional arguments

* Add timeout kwargs in Accelerator constructor

* Add encoding to be utf-8 for cmmmu

* Fix llava try and catch, remove torch.distributed.init in main

* Ds prepare script for llava

---------

Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
Co-authored-by: kcz358 <kaichenzhang358@outlook.com>
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
* refactor vizwizvqa task

* Delete vqav2_test and vqav2_val YAML files

* Refactor vqav2_process_results functions

* Add a pack for vqav2

* refactor okvqa

* roll back vizwiz_vqa

* Fix exact_match calculation in ok_vqa_process_results

* Update OKVQA dataset name in readme

* add model_specific_prompt_kwargs

* add model_specific_prompt_kwargs to vizwiz_vqa

* add model_specific_prompt_kwargs for vqav2

* lint

* fix a small bug for eval_logger

* Refactor make_table function to display points as "  -  " if value is None

* Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

* Refactor ok_vqa_aggreate_submissions function

* Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

* Refactor VQA submission file saving

* Update file utils

* Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

* Refactor file path handling and submission generation

* OKVQA path

* vizwizvqa file

* pack cmmmu

* fix a small metric bug for cmmmu

* Add higher_is_better flag to submission metric

* Add CMMMU dataset to README.md

* Add logging and refactor submission file generation in docvqa utils.py

* pack docvqa

* add traceback to print detailed error

* Refactor docvqa_test_aggregate_results to accept additional arguments

* Add metric check in evaluator.py and update test.yaml and val.yaml

* add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

* merge textvqa

* textvqa

* Modify submission file generation for COCO test results

* Update test result storage path

* update coco cap file name

* Update COCO 2017 Caption dataset name

* ferret

* Add Ferret dataset

* Refactor hb_doc_to_text function to include model-specific prompts

* Add IconQA and its subtasks

* Refactor image list creation in doc_to_visual function

* Add process_results function to default template

* Update process_results function in iconqa utils.py

* refactor flickr30k

* change aggregation function

* Fix formatting issues and update logging message

* Fix llava can not handle only text question (no visuals)

* Fix qwen can not handle no image question (no visuals)

* Add fuyu prepare accelerator scripts

* refactor mme

* naming consistency

* aggregation_submissions consistency

* flickr30k naming consistency

* remove submissions for mme

* remove unused submission function

* Refactor infovqa_test.yaml and infovqa_val.yaml

* Refactor code for improved readability and maintainability

* stvqa

* remane sqa

* Update lmms_eval textcaps files and utils.py

* Update default prompt for text captions

* Refactor textcaps_aggregation_result function

* Add generate_submission_file function and update mathvista_aggregate_results signature

* Update nocaps_test.yaml and nocaps_val.yaml

* refractor internal_eval

* Add internal evaluation datasets

* pack multidocvqa

* mmvet

* Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

* Refractor llava wild

* Refractor llava-bench-coco

* Add JSON file generation for gpt evaluation details

* mmmu

* Remove MMBench English and Chinese tasks

* Remove unnecessary return statement in mmbench_aggregate_test_results function

* Fix distributed process group initialization

* Update dataset paths and group names in mmbench test configs

* Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

* Add torch module import

* lint

* Remove IconQA dataset from README.md

* Add Multi-DocVQA and its submodules

* Add new datasets and update task names

* Refactor flickr_aggregation_result function to accept additional arguments

* Add timeout kwargs in Accelerator constructor

* Add encoding to be utf-8 for cmmmu

* Fix llava try and catch, remove torch.distributed.init in main

* Ds prepare script for llava

---------

Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
Co-authored-by: kcz358 <kaichenzhang358@outlook.com>
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
* refactor vizwizvqa task

* Delete vqav2_test and vqav2_val YAML files

* Refactor vqav2_process_results functions

* Add a pack for vqav2

* refactor okvqa

* roll back vizwiz_vqa

* Fix exact_match calculation in ok_vqa_process_results

* Update OKVQA dataset name in readme

* add model_specific_prompt_kwargs

* add model_specific_prompt_kwargs to vizwiz_vqa

* add model_specific_prompt_kwargs for vqav2

* lint

* fix a small bug for eval_logger

* Refactor make_table function to display points as "  -  " if value is None

* Merge commit '5e73e8b8a2408bd8193361788669ca80db19cb04'

* Refactor ok_vqa_aggreate_submissions function

* Merge commit '40099e8b8145bde513b9b7cef8461d8f13d1eafe'

* Refactor VQA submission file saving

* Update file utils

* Merge commit 'a56fe11c00ad4a8b8967be88b93baef6649528c5'

* Refactor file path handling and submission generation

* OKVQA path

* vizwizvqa file

* pack cmmmu

* fix a small metric bug for cmmmu

* Add higher_is_better flag to submission metric

* Add CMMMU dataset to README.md

* Add logging and refactor submission file generation in docvqa utils.py

* pack docvqa

* add traceback to print detailed error

* Refactor docvqa_test_aggregate_results to accept additional arguments

* Add metric check in evaluator.py and update test.yaml and val.yaml

* add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

* merge textvqa

* textvqa

* Modify submission file generation for COCO test results

* Update test result storage path

* update coco cap file name

* Update COCO 2017 Caption dataset name

* ferret

* Add Ferret dataset

* Refactor hb_doc_to_text function to include model-specific prompts

* Add IconQA and its subtasks

* Refactor image list creation in doc_to_visual function

* Add process_results function to default template

* Update process_results function in iconqa utils.py

* refactor flickr30k

* change aggregation function

* Fix formatting issues and update logging message

* Fix llava can not handle only text question (no visuals)

* Fix qwen can not handle no image question (no visuals)

* Add fuyu prepare accelerator scripts

* refactor mme

* naming consistency

* aggregation_submissions consistency

* flickr30k naming consistency

* remove submissions for mme

* remove unused submission function

* Refactor infovqa_test.yaml and infovqa_val.yaml

* Refactor code for improved readability and maintainability

* stvqa

* remane sqa

* Update lmms_eval textcaps files and utils.py

* Update default prompt for text captions

* Refactor textcaps_aggregation_result function

* Add generate_submission_file function and update mathvista_aggregate_results signature

* Update nocaps_test.yaml and nocaps_val.yaml

* refractor internal_eval

* Add internal evaluation datasets

* pack multidocvqa

* mmvet

* Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

* Refractor llava wild

* Refractor llava-bench-coco

* Add JSON file generation for gpt evaluation details

* mmmu

* Remove MMBench English and Chinese tasks

* Remove unnecessary return statement in mmbench_aggregate_test_results function

* Fix distributed process group initialization

* Update dataset paths and group names in mmbench test configs

* Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

* Add torch module import

* lint

* Remove IconQA dataset from README.md

* Add Multi-DocVQA and its submodules

* Add new datasets and update task names

* Refactor flickr_aggregation_result function to accept additional arguments

* Add timeout kwargs in Accelerator constructor

* Add encoding to be utf-8 for cmmmu

* Fix llava try and catch, remove torch.distributed.init in main

* Ds prepare script for llava

---------

Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
Co-authored-by: kcz358 <kaichenzhang358@outlook.com>
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
* refactor vizwizvqa task

* Delete vqav2_test and vqav2_val YAML files

* Refactor vqav2_process_results functions

* Add a pack for vqav2

* refactor okvqa

* roll back vizwiz_vqa

* Fix exact_match calculation in ok_vqa_process_results

* Update OKVQA dataset name in readme

* add model_specific_prompt_kwargs

* add model_specific_prompt_kwargs to vizwiz_vqa

* add model_specific_prompt_kwargs for vqav2

* lint

* fix a small bug for eval_logger

* Refactor make_table function to display points as "  -  " if value is None

* Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

* Refactor ok_vqa_aggreate_submissions function

* Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

* Refactor VQA submission file saving

* Update file utils

* Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

* Refactor file path handling and submission generation

* OKVQA path

* vizwizvqa file

* pack cmmmu

* fix a small metric bug for cmmmu

* Add higher_is_better flag to submission metric

* Add CMMMU dataset to README.md

* Add logging and refactor submission file generation in docvqa utils.py

* pack docvqa

* add traceback to print detailed error

* Refactor docvqa_test_aggregate_results to accept additional arguments

* Add metric check in evaluator.py and update test.yaml and val.yaml

* add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

* merge textvqa

* textvqa

* Modify submission file generation for COCO test results

* Update test result storage path

* update coco cap file name

* Update COCO 2017 Caption dataset name

* ferret

* Add Ferret dataset

* Refactor hb_doc_to_text function to include model-specific prompts

* Add IconQA and its subtasks

* Refactor image list creation in doc_to_visual function

* Add process_results function to default template

* Update process_results function in iconqa utils.py

* refactor flickr30k

* change aggregation function

* Fix formatting issues and update logging message

* Fix llava can not handle only text question (no visuals)

* Fix qwen can not handle no image question (no visuals)

* Add fuyu prepare accelerator scripts

* refactor mme

* naming consistency

* aggregation_submissions consistency

* flickr30k naming consistency

* remove submissions for mme

* remove unused submission function

* Refactor infovqa_test.yaml and infovqa_val.yaml

* Refactor code for improved readability and maintainability

* stvqa

* remane sqa

* Update lmms_eval textcaps files and utils.py

* Update default prompt for text captions

* Refactor textcaps_aggregation_result function

* Add generate_submission_file function and update mathvista_aggregate_results signature

* Update nocaps_test.yaml and nocaps_val.yaml

* refractor internal_eval

* Add internal evaluation datasets

* pack multidocvqa

* mmvet

* Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

* Refractor llava wild

* Refractor llava-bench-coco

* Add JSON file generation for gpt evaluation details

* mmmu

* Remove MMBench English and Chinese tasks

* Remove unnecessary return statement in mmbench_aggregate_test_results function

* Fix distributed process group initialization

* Update dataset paths and group names in mmbench test configs

* Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

* Add torch module import

* lint

* Remove IconQA dataset from README.md

* Add Multi-DocVQA and its submodules

* Add new datasets and update task names

* Refactor flickr_aggregation_result function to accept additional arguments

* Add timeout kwargs in Accelerator constructor

* Add encoding to be utf-8 for cmmmu

* Fix llava try and catch, remove torch.distributed.init in main

* Ds prepare script for llava

---------

Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
Co-authored-by: kcz358 <kaichenzhang358@outlook.com>
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
…d context issue (EvolvingLMMs-Lab#59)

* Refactor logging in lmms_eval package

* Refactor variable names in lmms_eval package

* Update README.md with new features and installation instructions

* Update supported models and datasets

* Delete otter.py file

* Fix capitalization in README.md

* Update image sizes and add new features

* Refactor README.md to improve readability and add new features

* Add description for lmms-eval in README.md

* Update accelerator support in README.md

* Update lmms-eval README with improved description and additional features

* Update README.md with improved task grouping description

* change `Otter-AI/MME` to `lmms-lab/MME`

* Update README.md

* Update README.md

* Remove unused code in mme.yaml

* Squashed commit of the following:

commit b3f1eff
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 0f26c8a
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit fefc964
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* add llava main in pyproject

* Update README.md

* Remove unnecessary dependencies and add specific version for llava_repr

* Add dependencies for llava_repr***

* Update README.md

* add some docs on models and command line commands

* remove some lines

* typo

* Update model_guide.md

* Update model_guide.md

* Update README.md

* Update README.md

* Update README.md

* Fix refcocog dataset path

* Record gpt response in eval info

* Resolve conflict

* Fix hallusionbench gpt json saving path

* Rename hallubench gpt output path

* Change remove image to check by type instead of check by names

* More robust check by type

* Remove unnecessary img in data

* Forcing an empty commit.

* Testing

* Delete unnecessary things

* Fix seedbench2 image issue in doc_to_text

* Add conditional exclude for internal eval

* Fix small bugs in list_with_num

* Revise list_with_num model args

---------

Co-authored-by: Bo Li <drluodian@gmail.com>
Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
Co-authored-by: jzhang38 <a1286225768@gmail.com>
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
…d context issue (EvolvingLMMs-Lab#59)

* Refactor logging in lmms_eval package

* Refactor variable names in lmms_eval package

* Update README.md with new features and installation instructions

* Update supported models and datasets

* Delete otter.py file

* Fix capitalization in README.md

* Update image sizes and add new features

* Refactor README.md to improve readability and add new features

* Add description for lmms-eval in README.md

* Update accelerator support in README.md

* Update lmms-eval README with improved description and additional features

* Update README.md with improved task grouping description

* change `Otter-AI/MME` to `lmms-lab/MME`

* Update README.md

* Update README.md

* Remove unused code in mme.yaml

* Squashed commit of the following:

commit fffe545
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit c608dd6
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'b636596c46dce543cdfacc0809c5b14edafcf1fd'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit '5624cd5b92ff6b1bc1d431a615d938fd623a03c4'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '034d73b022739333da5e60f432330b8ea832ef9b'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit a0959f1
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* add llava main in pyproject

* Update README.md

* Remove unnecessary dependencies and add specific version for llava_repr

* Add dependencies for llava_repr***

* Update README.md

* add some docs on models and command line commands

* remove some lines

* typo

* Update model_guide.md

* Update model_guide.md

* Update README.md

* Update README.md

* Update README.md

* Fix refcocog dataset path

* Record gpt response in eval info

* Resolve conflict

* Fix hallusionbench gpt json saving path

* Rename hallubench gpt output path

* Change remove image to check by type instead of check by names

* More robust check by type

* Remove unnecessary img in data

* Forcing an empty commit.

* Testing

* Delete unnecessary things

* Fix seedbench2 image issue in doc_to_text

* Add conditional exclude for internal eval

* Fix small bugs in list_with_num

* Revise list_with_num model args

---------

Co-authored-by: Bo Li <drluodian@gmail.com>
Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
Co-authored-by: jzhang38 <a1286225768@gmail.com>
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
…d context issue (EvolvingLMMs-Lab#59)

* Refactor logging in lmms_eval package

* Refactor variable names in lmms_eval package

* Update README.md with new features and installation instructions

* Update supported models and datasets

* Delete otter.py file

* Fix capitalization in README.md

* Update image sizes and add new features

* Refactor README.md to improve readability and add new features

* Add description for lmms-eval in README.md

* Update accelerator support in README.md

* Update lmms-eval README with improved description and additional features

* Update README.md with improved task grouping description

* change `Otter-AI/MME` to `lmms-lab/MME`

* Update README.md

* Update README.md

* Remove unused code in mme.yaml

* Squashed commit of the following:

commit f6a7654
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 6dbf2a9
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit cbe3e52
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* add llava main in pyproject

* Update README.md

* Remove unnecessary dependencies and add specific version for llava_repr

* Add dependencies for llava_repr***

* Update README.md

* add some docs on models and command line commands

* remove some lines

* typo

* Update model_guide.md

* Update model_guide.md

* Update README.md

* Update README.md

* Update README.md

* Fix refcocog dataset path

* Record gpt response in eval info

* Resolve conflict

* Fix hallusionbench gpt json saving path

* Rename hallubench gpt output path

* Change remove image to check by type instead of check by names

* More robust check by type

* Remove unnecessary img in data

* Forcing an empty commit.

* Testing

* Delete unnecessary things

* Fix seedbench2 image issue in doc_to_text

* Add conditional exclude for internal eval

* Fix small bugs in list_with_num

* Revise list_with_num model args

---------

Co-authored-by: Bo Li <drluodian@gmail.com>
Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
Co-authored-by: jzhang38 <a1286225768@gmail.com>
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
…d context issue (EvolvingLMMs-Lab#59)

* Refactor logging in lmms_eval package

* Refactor variable names in lmms_eval package

* Update README.md with new features and installation instructions

* Update supported models and datasets

* Delete otter.py file

* Fix capitalization in README.md

* Update image sizes and add new features

* Refactor README.md to improve readability and add new features

* Add description for lmms-eval in README.md

* Update accelerator support in README.md

* Update lmms-eval README with improved description and additional features

* Update README.md with improved task grouping description

* change `Otter-AI/MME` to `lmms-lab/MME`

* Update README.md

* Update README.md

* Remove unused code in mme.yaml

* Squashed commit of the following:

commit a68962a
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 0b02105
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit '5e73e8b8a2408bd8193361788669ca80db19cb04'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit '40099e8b8145bde513b9b7cef8461d8f13d1eafe'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit 'a56fe11c00ad4a8b8967be88b93baef6649528c5'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit f4af7d0
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* add llava main in pyproject

* Update README.md

* Remove unnecessary dependencies and add specific version for llava_repr

* Add dependencies for llava_repr***

* Update README.md

* add some docs on models and command line commands

* remove some lines

* typo

* Update model_guide.md

* Update model_guide.md

* Update README.md

* Update README.md

* Update README.md

* Fix refcocog dataset path

* Record gpt response in eval info

* Resolve conflict

* Fix hallusionbench gpt json saving path

* Rename hallubench gpt output path

* Change remove image to check by type instead of check by names

* More robust check by type

* Remove unnecessary img in data

* Forcing an empty commit.

* Testing

* Delete unnecessary things

* Fix seedbench2 image issue in doc_to_text

* Add conditional exclude for internal eval

* Fix small bugs in list_with_num

* Revise list_with_num model args

---------

Co-authored-by: Bo Li <drluodian@gmail.com>
Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
Co-authored-by: jzhang38 <a1286225768@gmail.com>
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
…d context issue (EvolvingLMMs-Lab#59)

* Refactor logging in lmms_eval package

* Refactor variable names in lmms_eval package

* Update README.md with new features and installation instructions

* Update supported models and datasets

* Delete otter.py file

* Fix capitalization in README.md

* Update image sizes and add new features

* Refactor README.md to improve readability and add new features

* Add description for lmms-eval in README.md

* Update accelerator support in README.md

* Update lmms-eval README with improved description and additional features

* Update README.md with improved task grouping description

* change `Otter-AI/MME` to `lmms-lab/MME`

* Update README.md

* Update README.md

* Remove unused code in mme.yaml

* Squashed commit of the following:

commit 2782eb0
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 7e8d3e4
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 4fa73ba
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* add llava main in pyproject

* Update README.md

* Remove unnecessary dependencies and add specific version for llava_repr

* Add dependencies for llava_repr***

* Update README.md

* add some docs on models and command line commands

* remove some lines

* typo

* Update model_guide.md

* Update model_guide.md

* Update README.md

* Update README.md

* Update README.md

* Fix refcocog dataset path

* Record gpt response in eval info

* Resolve conflict

* Fix hallusionbench gpt json saving path

* Rename hallubench gpt output path

* Change remove image to check by type instead of check by names

* More robust check by type

* Remove unnecessary img in data

* Forcing an empty commit.

* Testing

* Delete unnecessary things

* Fix seedbench2 image issue in doc_to_text

* Add conditional exclude for internal eval

* Fix small bugs in list_with_num

* Revise list_with_num model args

---------

Co-authored-by: Bo Li <drluodian@gmail.com>
Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
Co-authored-by: jzhang38 <a1286225768@gmail.com>
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
* Refactor logging in lmms_eval package

* Refactor variable names in lmms_eval package

* Update README.md with new features and installation instructions

* Update supported models and datasets

* Delete otter.py file

* Fix capitalization in README.md

* Update image sizes and add new features

* Refactor README.md to improve readability and add new features

* Add description for lmms-eval in README.md

* Update accelerator support in README.md

* Update lmms-eval README with improved description and additional features

* Update README.md with improved task grouping description

* change `Otter-AI/MME` to `lmms-lab/MME`

* Update README.md

* Update README.md

* Remove unused code in mme.yaml

* Squashed commit of the following:

commit 5c6e0c8
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 8bd568e
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit '63fc8eee4dddfbe741e5a862e5ff30d19c34238e'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'd16bbce134d453c624834e090af1e0f869fdde15'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '7332704263a45ab6fa69aad0c4303cd9cbc26813'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 0e0c698
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* add llava main in pyproject

* Update README.md

* Remove unnecessary dependencies and add specific version for llava_repr

* Add dependencies for llava_repr***

* Update README.md

* add some docs on models and command line commands

* remove some lines

* typo

* Update model_guide.md

* Update model_guide.md

* Update README.md

* Update README.md

* Update README.md

* Fix refcocog dataset path

* Record gpt response in eval info

* Resolve conflict

* Fix hallusionbench gpt json saving path

* Rename hallubench gpt output path

* Change remove image to check by type instead of check by names

* More robust check by type

* Remove unnecessary img in data

* Forcing an empty commit.

* Testing

* Delete unnecessary things

* Fix seedbench2 image issue in doc_to_text

* Add conditional exclude for internal eval

* Fix small bugs in list_with_num

* Revise list_with_num model args

* Fix logging utils bug on wandb grouping

---------

Co-authored-by: Bo Li <drluodian@gmail.com>
Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
Co-authored-by: jzhang38 <a1286225768@gmail.com>
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
* Refactor logging in lmms_eval package

* Refactor variable names in lmms_eval package

* Update README.md with new features and installation instructions

* Update supported models and datasets

* Delete otter.py file

* Fix capitalization in README.md

* Update image sizes and add new features

* Refactor README.md to improve readability and add new features

* Add description for lmms-eval in README.md

* Update accelerator support in README.md

* Update lmms-eval README with improved description and additional features

* Update README.md with improved task grouping description

* change `Otter-AI/MME` to `lmms-lab/MME`

* Update README.md

* Update README.md

* Remove unused code in mme.yaml

* Squashed commit of the following:

commit 90fbf3d
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 0fa3bce
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 0182d5d
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* add llava main in pyproject

* Update README.md

* Remove unnecessary dependencies and add specific version for llava_repr

* Add dependencies for llava_repr***

* Update README.md

* add some docs on models and command line commands

* remove some lines

* typo

* Update model_guide.md

* Update model_guide.md

* Update README.md

* Update README.md

* Update README.md

* Fix refcocog dataset path

* Record gpt response in eval info

* Resolve conflict

* Fix hallusionbench gpt json saving path

* Rename hallubench gpt output path

* Change remove image to check by type instead of check by names

* More robust check by type

* Remove unnecessary img in data

* Forcing an empty commit.

* Testing

* Delete unnecessary things

* Fix seedbench2 image issue in doc_to_text

* Add conditional exclude for internal eval

* Fix small bugs in list_with_num

* Revise list_with_num model args

* Fix logging utils bug on wandb grouping

---------

Co-authored-by: Bo Li <drluodian@gmail.com>
Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
Co-authored-by: jzhang38 <a1286225768@gmail.com>
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
* Refactor logging in lmms_eval package

* Refactor variable names in lmms_eval package

* Update README.md with new features and installation instructions

* Update supported models and datasets

* Delete otter.py file

* Fix capitalization in README.md

* Update image sizes and add new features

* Refactor README.md to improve readability and add new features

* Add description for lmms-eval in README.md

* Update accelerator support in README.md

* Update lmms-eval README with improved description and additional features

* Update README.md with improved task grouping description

* change `Otter-AI/MME` to `lmms-lab/MME`

* Update README.md

* Update README.md

* Remove unused code in mme.yaml

* Squashed commit of the following:

commit 9c0bc58
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 30ab0ce
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit '5e73e8b8a2408bd8193361788669ca80db19cb04'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit '40099e8b8145bde513b9b7cef8461d8f13d1eafe'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit 'a56fe11c00ad4a8b8967be88b93baef6649528c5'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit a5b07ee
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* add llava main in pyproject

* Update README.md

* Remove unnecessary dependencies and add specific version for llava_repr

* Add dependencies for llava_repr***

* Update README.md

* add some docs on models and command line commands

* remove some lines

* typo

* Update model_guide.md

* Update model_guide.md

* Update README.md

* Update README.md

* Update README.md

* Fix refcocog dataset path

* Record gpt response in eval info

* Resolve conflict

* Fix hallusionbench gpt json saving path

* Rename hallubench gpt output path

* Change remove image to check by type instead of check by names

* More robust check by type

* Remove unnecessary img in data

* Forcing an empty commit.

* Testing

* Delete unnecessary things

* Fix seedbench2 image issue in doc_to_text

* Add conditional exclude for internal eval

* Fix small bugs in list_with_num

* Revise list_with_num model args

* Fix logging utils bug on wandb grouping

---------

Co-authored-by: Bo Li <drluodian@gmail.com>
Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
Co-authored-by: jzhang38 <a1286225768@gmail.com>
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
* Refactor logging in lmms_eval package

* Refactor variable names in lmms_eval package

* Update README.md with new features and installation instructions

* Update supported models and datasets

* Delete otter.py file

* Fix capitalization in README.md

* Update image sizes and add new features

* Refactor README.md to improve readability and add new features

* Add description for lmms-eval in README.md

* Update accelerator support in README.md

* Update lmms-eval README with improved description and additional features

* Update README.md with improved task grouping description

* change `Otter-AI/MME` to `lmms-lab/MME`

* Update README.md

* Update README.md

* Remove unused code in mme.yaml

* Squashed commit of the following:

commit b3f1eff
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 0f26c8a
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit fefc964
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* add llava main in pyproject

* Update README.md

* Remove unnecessary dependencies and add specific version for llava_repr

* Add dependencies for llava_repr***

* Update README.md

* add some docs on models and command line commands

* remove some lines

* typo

* Update model_guide.md

* Update model_guide.md

* Update README.md

* Update README.md

* Update README.md

* Fix refcocog dataset path

* Record gpt response in eval info

* Resolve conflict

* Fix hallusionbench gpt json saving path

* Rename hallubench gpt output path

* Change remove image to check by type instead of check by names

* More robust check by type

* Remove unnecessary img in data

* Forcing an empty commit.

* Testing

* Delete unnecessary things

* Fix seedbench2 image issue in doc_to_text

* Add conditional exclude for internal eval

* Fix small bugs in list_with_num

* Revise list_with_num model args

* Fix logging utils bug on wandb grouping

---------

Co-authored-by: Bo Li <drluodian@gmail.com>
Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
Co-authored-by: jzhang38 <a1286225768@gmail.com>
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
* Refactor logging in lmms_eval package

* Refactor variable names in lmms_eval package

* Update README.md with new features and installation instructions

* Update supported models and datasets

* Delete otter.py file

* Fix capitalization in README.md

* Update image sizes and add new features

* Refactor README.md to improve readability and add new features

* Add description for lmms-eval in README.md

* Update accelerator support in README.md

* Update lmms-eval README with improved description and additional features

* Update README.md with improved task grouping description

* change `Otter-AI/MME` to `lmms-lab/MME`

* Update README.md

* Update README.md

* Remove unused code in mme.yaml

* Squashed commit of the following:

commit fffe545
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit c608dd6
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'b636596c46dce543cdfacc0809c5b14edafcf1fd'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit '5624cd5b92ff6b1bc1d431a615d938fd623a03c4'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '034d73b022739333da5e60f432330b8ea832ef9b'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit a0959f1
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* add llava main in pyproject

* Update README.md

* Remove unnecessary dependencies and add specific version for llava_repr

* Add dependencies for llava_repr***

* Update README.md

* add some docs on models and command line commands

* remove some lines

* typo

* Update model_guide.md

* Update model_guide.md

* Update README.md

* Update README.md

* Update README.md

* Fix refcocog dataset path

* Record gpt response in eval info

* Resolve conflict

* Fix hallusionbench gpt json saving path

* Rename hallubench gpt output path

* Change remove image to check by type instead of check by names

* More robust check by type

* Remove unnecessary img in data

* Forcing an empty commit.

* Testing

* Delete unnecessary things

* Fix seedbench2 image issue in doc_to_text

* Add conditional exclude for internal eval

* Fix small bugs in list_with_num

* Revise list_with_num model args

* Fix logging utils bug on wandb grouping

---------

Co-authored-by: Bo Li <drluodian@gmail.com>
Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
Co-authored-by: jzhang38 <a1286225768@gmail.com>
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
* Refactor logging in lmms_eval package

* Refactor variable names in lmms_eval package

* Update README.md with new features and installation instructions

* Update supported models and datasets

* Delete otter.py file

* Fix capitalization in README.md

* Update image sizes and add new features

* Refactor README.md to improve readability and add new features

* Add description for lmms-eval in README.md

* Update accelerator support in README.md

* Update lmms-eval README with improved description and additional features

* Update README.md with improved task grouping description

* change `Otter-AI/MME` to `lmms-lab/MME`

* Update README.md

* Update README.md

* Remove unused code in mme.yaml

* Squashed commit of the following:

commit f6a7654
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 6dbf2a9
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit cbe3e52
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* add llava main in pyproject

* Update README.md

* Remove unnecessary dependencies and add specific version for llava_repr

* Add dependencies for llava_repr***

* Update README.md

* add some docs on models and command line commands

* remove some lines

* typo

* Update model_guide.md

* Update model_guide.md

* Update README.md

* Update README.md

* Update README.md

* Fix refcocog dataset path

* Record gpt response in eval info

* Resolve conflict

* Fix hallusionbench gpt json saving path

* Rename hallubench gpt output path

* Change remove image to check by type instead of check by names

* More robust check by type

* Remove unnecessary img in data

* Forcing an empty commit.

* Testing

* Delete unnecessary things

* Fix seedbench2 image issue in doc_to_text

* Add conditional exclude for internal eval

* Fix small bugs in list_with_num

* Revise list_with_num model args

* Fix logging utils bug on wandb grouping

---------

Co-authored-by: Bo Li <drluodian@gmail.com>
Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
Co-authored-by: jzhang38 <a1286225768@gmail.com>
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
* Refactor logging in lmms_eval package

* Refactor variable names in lmms_eval package

* Update README.md with new features and installation instructions

* Update supported models and datasets

* Delete otter.py file

* Fix capitalization in README.md

* Update image sizes and add new features

* Refactor README.md to improve readability and add new features

* Add description for lmms-eval in README.md

* Update accelerator support in README.md

* Update lmms-eval README with improved description and additional features

* Update README.md with improved task grouping description

* change `Otter-AI/MME` to `lmms-lab/MME`

* Update README.md

* Update README.md

* Remove unused code in mme.yaml

* Squashed commit of the following:

commit a68962a
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 0b02105
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit '5e73e8b8a2408bd8193361788669ca80db19cb04'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit '40099e8b8145bde513b9b7cef8461d8f13d1eafe'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit 'a56fe11c00ad4a8b8967be88b93baef6649528c5'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit f4af7d0
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* add llava main in pyproject

* Update README.md

* Remove unnecessary dependencies and add specific version for llava_repr

* Add dependencies for llava_repr***

* Update README.md

* add some docs on models and command line commands

* remove some lines

* typo

* Update model_guide.md

* Update model_guide.md

* Update README.md

* Update README.md

* Update README.md

* Fix refcocog dataset path

* Record gpt response in eval info

* Resolve conflict

* Fix hallusionbench gpt json saving path

* Rename hallubench gpt output path

* Change remove image to check by type instead of check by names

* More robust check by type

* Remove unnecessary img in data

* Forcing an empty commit.

* Testing

* Delete unnecessary things

* Fix seedbench2 image issue in doc_to_text

* Add conditional exclude for internal eval

* Fix small bugs in list_with_num

* Revise list_with_num model args

* Fix logging utils bug on wandb grouping

---------

Co-authored-by: Bo Li <drluodian@gmail.com>
Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
Co-authored-by: jzhang38 <a1286225768@gmail.com>
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
* Refactor logging in lmms_eval package

* Refactor variable names in lmms_eval package

* Update README.md with new features and installation instructions

* Update supported models and datasets

* Delete otter.py file

* Fix capitalization in README.md

* Update image sizes and add new features

* Refactor README.md to improve readability and add new features

* Add description for lmms-eval in README.md

* Update accelerator support in README.md

* Update lmms-eval README with improved description and additional features

* Update README.md with improved task grouping description

* change `Otter-AI/MME` to `lmms-lab/MME`

* Update README.md

* Update README.md

* Remove unused code in mme.yaml

* Squashed commit of the following:

commit 2782eb0
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 7e8d3e4
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 4fa73ba
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* add llava main in pyproject

* Update README.md

* Remove unnecessary dependencies and add specific version for llava_repr

* Add dependencies for llava_repr***

* Update README.md

* add some docs on models and command line commands

* remove some lines

* typo

* Update model_guide.md

* Update model_guide.md

* Update README.md

* Update README.md

* Update README.md

* Fix refcocog dataset path

* Record gpt response in eval info

* Resolve conflict

* Fix hallusionbench gpt json saving path

* Rename hallubench gpt output path

* Change remove image to check by type instead of check by names

* More robust check by type

* Remove unnecessary img in data

* Forcing an empty commit.

* Testing

* Delete unnecessary things

* Fix seedbench2 image issue in doc_to_text

* Add conditional exclude for internal eval

* Fix small bugs in list_with_num

* Revise list_with_num model args

* Fix logging utils bug on wandb grouping

---------

Co-authored-by: Bo Li <drluodian@gmail.com>
Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
Co-authored-by: jzhang38 <a1286225768@gmail.com>
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
…volvingLMMs-Lab#62)

* Refactor logging in lmms_eval package

* Refactor variable names in lmms_eval package

* Update README.md with new features and installation instructions

* Update supported models and datasets

* Delete otter.py file

* Fix capitalization in README.md

* Update image sizes and add new features

* Refactor README.md to improve readability and add new features

* Add description for lmms-eval in README.md

* Update accelerator support in README.md

* Update lmms-eval README with improved description and additional features

* Update README.md with improved task grouping description

* change `Otter-AI/MME` to `lmms-lab/MME`

* Update README.md

* Update README.md

* Remove unused code in mme.yaml

* Squashed commit of the following:

commit 5c6e0c8
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 8bd568e
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit '63fc8eee4dddfbe741e5a862e5ff30d19c34238e'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'd16bbce134d453c624834e090af1e0f869fdde15'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '7332704263a45ab6fa69aad0c4303cd9cbc26813'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 0e0c698
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* add llava main in pyproject

* Update README.md

* Remove unnecessary dependencies and add specific version for llava_repr

* Add dependencies for llava_repr***

* Update README.md

* add some docs on models and command line commands

* remove some lines

* typo

* Update model_guide.md

* Update model_guide.md

* Update README.md

* Update README.md

* Update README.md

* Fix refcocog dataset path

* Record gpt response in eval info

* Resolve conflict

* Fix hallusionbench gpt json saving path

* Rename hallubench gpt output path

* Change remove image to check by type instead of check by names

* More robust check by type

* Add timeout to API requests

* Remove unnecessary img in data

* Forcing an empty commit.

* Testing

* Delete unnecessary things

* Fix error logging in get_chat_response function

* Fix seedbench2 image issue in doc_to_text

* Add conditional exclude for internal eval

* Squashed commit of the following:

commit 584db7fcc0140dd4a6d6481529ae90570b2912c4
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:49:36 2024 +0000

    Add conditional exclude for internal eval

commit 5e52a8df3785eb2d1b392eb164b66e92c9dadb02
Merge: a3cae8e ffb9eb2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:24:29 2024 +0000

    Merge branch 'dev/readme' into kc/final_fix

commit a3cae8e9f3570121d51885c71f7081da36c5d13d
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 02:47:31 2024 +0000

    Fix seedbench2 image issue in doc_to_text

commit 0b3cad596fd58e6414ea015e79bef1eea6eb7f7a
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:32:49 2024 +0000

    Delete unnecessary things

commit f436cb65bd716d93044516ece2133ab5b8d87137
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:31:42 2024 +0000

    Testing

commit 3d47b59f92cef22cfe38e00b407ce38a61d538b2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:29:30 2024 +0000

    Forcing an empty commit.

commit ffb9eb26dae25cda1e0d3e302852862102b47054
Merge: 786f2b5 1700786
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:56 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 1700786b572cbedcb6969ae97028225d388987bb
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:20 2024 +0000

    Remove unnecessary img in data

commit 786f2b53d57265b9900b0718d27538221b5f81b4
Merge: 4240785 888c1c1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:41:24 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 888c1c128319bd04528727a309d0d92aaee9e752
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:40:51 2024 +0000

    More robust check by type

commit 8c74caa2f77940c781501b45571d7c6362c9a6c8
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:00:57 2024 +0000

    Change remove image to check by type instead of check by names

commit 4ab5cc32e3a460ad112dcd3031cea55b6bc0f691
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 12:33:02 2024 +0000

    Rename hallubench gpt output path

commit eae08b536908875eeb600538e853caaa14c655ae
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 09:32:52 2024 +0000

    Fix hallusionbench gpt json saving path

commit 4240785c1bf3a7fd15f36013803c004542a17f2e
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:51:13 2024 +0000

    Resolve conflict

commit c15bf75d2f76e215b4d5de43c1d17b5a41d79753
Merge: 9cf86fa 93534dc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:37:21 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 93534dc4e98b78b9da01099079187d8960705fb8
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:55:03 2024 +0000

    Record gpt response in eval info

commit 05166a14c45063bf108282c3202d32feb2fe0afa
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:49:01 2024 +0000

    Fix refcocog dataset path

commit 5c6e0c8
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 8bd568e
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit '63fc8eee4dddfbe741e5a862e5ff30d19c34238e'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'd16bbce134d453c624834e090af1e0f869fdde15'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '7332704263a45ab6fa69aad0c4303cd9cbc26813'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 0e0c698
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* Fix small bugs in list_with_num

* Revise list_with_num model args

* Dev/readme rm rolling (EvolvingLMMs-Lab#60)

* remove log_likelyhood_rolling

* Update time efficiency benchmark in README.md

* add task guide

---------

Co-authored-by: jzhang38 <a1286225768@gmail.com>
Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>

* Remove unnecessary code and update dependencies

* Fix logging utils bug on wandb grouping

* Add reproduce envs

* Squashed commit of the following:

commit 172a002845728f263a9221206aeab62bdc1070dc
Merge: 2475639 2152f18
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 22:12:12 2024 +0800

    Merge branch 'main' into kc/final_fix

commit 2475639fcf9164a7965b080c31dc50bc856fa053
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 22:11:04 2024 +0800

    Add reproduce envs

commit 2152f18
Author: kcz358 <92624596+kcz358@users.noreply.github.com>
Date:   Sun Mar 3 21:19:15 2024 +0800

    [Fix] wandb group logging missing columns (EvolvingLMMs-Lab#61)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

    * Update README.md with new features and installation instructions

    * Update supported models and datasets

    * Delete otter.py file

    * Fix capitalization in README.md

    * Update image sizes and add new features

    * Refactor README.md to improve readability and add new features

    * Add description for lmms-eval in README.md

    * Update accelerator support in README.md

    * Update lmms-eval README with improved description and additional features

    * Update README.md with improved task grouping description

    * change `Otter-AI/MME` to `lmms-lab/MME`

    * Update README.md

    * Update README.md

    * Remove unused code in mme.yaml

    * Squashed commit of the following:

    commit 5c6e0c8
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Thu Feb 29 13:40:02 2024 +0800

        Dev/py add models (EvolvingLMMs-Lab#57)

        * add instructblip

        * minicpm_v

        * remove <image> from qwen-vl

        * speed up postprocessing

        * Optimize build context speed

        ---------

        Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 8bd568e
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Wed Feb 28 14:49:07 2024 +0800

        Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

        * refactor vizwizvqa task

        * Delete vqav2_test and vqav2_val YAML files

        * Refactor vqav2_process_results functions

        * Add a pack for vqav2

        * refactor okvqa

        * roll back vizwiz_vqa

        * Fix exact_match calculation in ok_vqa_process_results

        * Update OKVQA dataset name in readme

        * add model_specific_prompt_kwargs

        * add model_specific_prompt_kwargs to vizwiz_vqa

        * add model_specific_prompt_kwargs for vqav2

        * lint

        * fix a small bug for eval_logger

        * Refactor make_table function to display points as "  -  " if value is None

        * Merge commit '63fc8eee4dddfbe741e5a862e5ff30d19c34238e'

        * Refactor ok_vqa_aggreate_submissions function

        * Merge commit 'd16bbce134d453c624834e090af1e0f869fdde15'

        * Refactor VQA submission file saving

        * Update file utils

        * Merge commit '7332704263a45ab6fa69aad0c4303cd9cbc26813'

        * Refactor file path handling and submission generation

        * OKVQA path

        * vizwizvqa file

        * pack cmmmu

        * fix a small metric bug for cmmmu

        * Add higher_is_better flag to submission metric

        * Add CMMMU dataset to README.md

        * Add logging and refactor submission file generation in docvqa utils.py

        * pack docvqa

        * add traceback to print detailed error

        * Refactor docvqa_test_aggregate_results to accept additional arguments

        * Add metric check in evaluator.py and update test.yaml and val.yaml

        * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

        * merge textvqa

        * textvqa

        * Modify submission file generation for COCO test results

        * Update test result storage path

        * update coco cap file name

        * Update COCO 2017 Caption dataset name

        * ferret

        * Add Ferret dataset

        * Refactor hb_doc_to_text function to include model-specific prompts

        * Add IconQA and its subtasks

        * Refactor image list creation in doc_to_visual function

        * Add process_results function to default template

        * Update process_results function in iconqa utils.py

        * refactor flickr30k

        * change aggregation function

        * Fix formatting issues and update logging message

        * Fix llava can not handle only text question (no visuals)

        * Fix qwen can not handle no image question (no visuals)

        * Add fuyu prepare accelerator scripts

        * refactor mme

        * naming consistency

        * aggregation_submissions consistency

        * flickr30k naming consistency

        * remove submissions for mme

        * remove unused submission function

        * Refactor infovqa_test.yaml and infovqa_val.yaml

        * Refactor code for improved readability and maintainability

        * stvqa

        * remane sqa

        * Update lmms_eval textcaps files and utils.py

        * Update default prompt for text captions

        * Refactor textcaps_aggregation_result function

        * Add generate_submission_file function and update mathvista_aggregate_results signature

        * Update nocaps_test.yaml and nocaps_val.yaml

        * refractor internal_eval

        * Add internal evaluation datasets

        * pack multidocvqa

        * mmvet

        * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

        * Refractor llava wild

        * Refractor llava-bench-coco

        * Add JSON file generation for gpt evaluation details

        * mmmu

        * Remove MMBench English and Chinese tasks

        * Remove unnecessary return statement in mmbench_aggregate_test_results function

        * Fix distributed process group initialization

        * Update dataset paths and group names in mmbench test configs

        * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

        * Add torch module import

        * lint

        * Remove IconQA dataset from README.md

        * Add Multi-DocVQA and its submodules

        * Add new datasets and update task names

        * Refactor flickr_aggregation_result function to accept additional arguments

        * Add timeout kwargs in Accelerator constructor

        * Add encoding to be utf-8 for cmmmu

        * Fix llava try and catch, remove torch.distributed.init in main

        * Ds prepare script for llava

        ---------

        Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 0e0c698
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Feb 27 22:52:07 2024 +0800

        [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

        * Refactor logging in lmms_eval package

        * Refactor variable names in lmms_eval package

    * add llava main in pyproject

    * Update README.md

    * Remove unnecessary dependencies and add specific version for llava_repr

    * Add dependencies for llava_repr***

    * Update README.md

    * add some docs on models and command line commands

    * remove some lines

    * typo

    * Update model_guide.md

    * Update model_guide.md

    * Update README.md

    * Update README.md

    * Update README.md

    * Fix refcocog dataset path

    * Record gpt response in eval info

    * Resolve conflict

    * Fix hallusionbench gpt json saving path

    * Rename hallubench gpt output path

    * Change remove image to check by type instead of check by names

    * More robust check by type

    * Remove unnecessary img in data

    * Forcing an empty commit.

    * Testing

    * Delete unnecessary things

    * Fix seedbench2 image issue in doc_to_text

    * Add conditional exclude for internal eval

    * Fix small bugs in list_with_num

    * Revise list_with_num model args

    * Fix logging utils bug on wandb grouping

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Co-authored-by: jzhang38 <a1286225768@gmail.com>

commit 5902608191d5a8a059c2a267afc0100f47140fae
Merge: 83358a4 fd7773d
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 07:25:48 2024 +0000

    Merge branch 'main' into kc/final_fix

commit 83358a42354d8ec57d3d887e2262f82e7dd4c532
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 07:23:19 2024 +0000

    Fix logging utils bug on wandb grouping

commit fd7773d
Author: kcz358 <92624596+kcz358@users.noreply.github.com>
Date:   Sun Mar 3 13:01:11 2024 +0800

    [Fix] refcocog dataset path, record gpt prompt in internal eval, build context issue (EvolvingLMMs-Lab#59)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

    * Update README.md with new features and installation instructions

    * Update supported models and datasets

    * Delete otter.py file

    * Fix capitalization in README.md

    * Update image sizes and add new features

    * Refactor README.md to improve readability and add new features

    * Add description for lmms-eval in README.md

    * Update accelerator support in README.md

    * Update lmms-eval README with improved description and additional features

    * Update README.md with improved task grouping description

    * change `Otter-AI/MME` to `lmms-lab/MME`

    * Update README.md

    * Update README.md

    * Remove unused code in mme.yaml

    * Squashed commit of the following:

    commit 5c6e0c8
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Thu Feb 29 13:40:02 2024 +0800

        Dev/py add models (EvolvingLMMs-Lab#57)

        * add instructblip

        * minicpm_v

        * remove <image> from qwen-vl

        * speed up postprocessing

        * Optimize build context speed

        ---------

        Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 8bd568e
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Wed Feb 28 14:49:07 2024 +0800

        Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

        * refactor vizwizvqa task

        * Delete vqav2_test and vqav2_val YAML files

        * Refactor vqav2_process_results functions

        * Add a pack for vqav2

        * refactor okvqa

        * roll back vizwiz_vqa

        * Fix exact_match calculation in ok_vqa_process_results

        * Update OKVQA dataset name in readme

        * add model_specific_prompt_kwargs

        * add model_specific_prompt_kwargs to vizwiz_vqa

        * add model_specific_prompt_kwargs for vqav2

        * lint

        * fix a small bug for eval_logger

        * Refactor make_table function to display points as "  -  " if value is None

        * Merge commit '63fc8eee4dddfbe741e5a862e5ff30d19c34238e'

        * Refactor ok_vqa_aggreate_submissions function

        * Merge commit 'd16bbce134d453c624834e090af1e0f869fdde15'

        * Refactor VQA submission file saving

        * Update file utils

        * Merge commit '7332704263a45ab6fa69aad0c4303cd9cbc26813'

        * Refactor file path handling and submission generation

        * OKVQA path

        * vizwizvqa file

        * pack cmmmu

        * fix a small metric bug for cmmmu

        * Add higher_is_better flag to submission metric

        * Add CMMMU dataset to README.md

        * Add logging and refactor submission file generation in docvqa utils.py

        * pack docvqa

        * add traceback to print detailed error

        * Refactor docvqa_test_aggregate_results to accept additional arguments

        * Add metric check in evaluator.py and update test.yaml and val.yaml

        * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

        * merge textvqa

        * textvqa

        * Modify submission file generation for COCO test results

        * Update test result storage path

        * update coco cap file name

        * Update COCO 2017 Caption dataset name

        * ferret

        * Add Ferret dataset

        * Refactor hb_doc_to_text function to include model-specific prompts

        * Add IconQA and its subtasks

        * Refactor image list creation in doc_to_visual function

        * Add process_results function to default template

        * Update process_results function in iconqa utils.py

        * refactor flickr30k

        * change aggregation function

        * Fix formatting issues and update logging message

        * Fix llava can not handle only text question (no visuals)

        * Fix qwen can not handle no image question (no visuals)

        * Add fuyu prepare accelerator scripts

        * refactor mme

        * naming consistency

        * aggregation_submissions consistency

        * flickr30k naming consistency

        * remove submissions for mme

        * remove unused submission function

        * Refactor infovqa_test.yaml and infovqa_val.yaml

        * Refactor code for improved readability and maintainability

        * stvqa

        * remane sqa

        * Update lmms_eval textcaps files and utils.py

        * Update default prompt for text captions

        * Refactor textcaps_aggregation_result function

        * Add generate_submission_file function and update mathvista_aggregate_results signature

        * Update nocaps_test.yaml and nocaps_val.yaml

        * refractor internal_eval

        * Add internal evaluation datasets

        * pack multidocvqa

        * mmvet

        * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

        * Refractor llava wild

        * Refractor llava-bench-coco

        * Add JSON file generation for gpt evaluation details

        * mmmu

        * Remove MMBench English and Chinese tasks

        * Remove unnecessary return statement in mmbench_aggregate_test_results function

        * Fix distributed process group initialization

        * Update dataset paths and group names in mmbench test configs

        * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

        * Add torch module import

        * lint

        * Remove IconQA dataset from README.md

        * Add Multi-DocVQA and its submodules

        * Add new datasets and update task names

        * Refactor flickr_aggregation_result function to accept additional arguments

        * Add timeout kwargs in Accelerator constructor

        * Add encoding to be utf-8 for cmmmu

        * Fix llava try and catch, remove torch.distributed.init in main

        * Ds prepare script for llava

        ---------

        Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 0e0c698
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Feb 27 22:52:07 2024 +0800

        [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

        * Refactor logging in lmms_eval package

        * Refactor variable names in lmms_eval package

    * add llava main in pyproject

    * Update README.md

    * Remove unnecessary dependencies and add specific version for llava_repr

    * Add dependencies for llava_repr***

    * Update README.md

    * add some docs on models and command line commands

    * remove some lines

    * typo

    * Update model_guide.md

    * Update model_guide.md

    * Update README.md

    * Update README.md

    * Update README.md

    * Fix refcocog dataset path

    * Record gpt response in eval info

    * Resolve conflict

    * Fix hallusionbench gpt json saving path

    * Rename hallubench gpt output path

    * Change remove image to check by type instead of check by names

    * More robust check by type

    * Remove unnecessary img in data

    * Forcing an empty commit.

    * Testing

    * Delete unnecessary things

    * Fix seedbench2 image issue in doc_to_text

    * Add conditional exclude for internal eval

    * Fix small bugs in list_with_num

    * Revise list_with_num model args

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Co-authored-by: jzhang38 <a1286225768@gmail.com>

commit ce51924783fa5c50f99815a33988476ee1220bac
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 05:58:08 2024 +0000

    Revise list_with_num model args

commit a288035b48620b827a82c1c45412fe2bb3c18715
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 05:09:15 2024 +0000

    Fix small bugs in list_with_num

commit 584db7fcc0140dd4a6d6481529ae90570b2912c4
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:49:36 2024 +0000

    Add conditional exclude for internal eval

commit 5e52a8df3785eb2d1b392eb164b66e92c9dadb02
Merge: a3cae8e ffb9eb2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:24:29 2024 +0000

    Merge branch 'dev/readme' into kc/final_fix

commit a3cae8e9f3570121d51885c71f7081da36c5d13d
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 02:47:31 2024 +0000

    Fix seedbench2 image issue in doc_to_text

commit 0b3cad596fd58e6414ea015e79bef1eea6eb7f7a
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:32:49 2024 +0000

    Delete unnecessary things

commit f436cb65bd716d93044516ece2133ab5b8d87137
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:31:42 2024 +0000

    Testing

commit 3d47b59f92cef22cfe38e00b407ce38a61d538b2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:29:30 2024 +0000

    Forcing an empty commit.

commit ffb9eb26dae25cda1e0d3e302852862102b47054
Merge: 786f2b5 1700786
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:56 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 1700786b572cbedcb6969ae97028225d388987bb
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:20 2024 +0000

    Remove unnecessary img in data

commit 786f2b53d57265b9900b0718d27538221b5f81b4
Merge: 4240785 888c1c1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:41:24 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 888c1c128319bd04528727a309d0d92aaee9e752
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:40:51 2024 +0000

    More robust check by type

commit 8c74caa2f77940c781501b45571d7c6362c9a6c8
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:00:57 2024 +0000

    Change remove image to check by type instead of check by names

commit 4ab5cc32e3a460ad112dcd3031cea55b6bc0f691
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 12:33:02 2024 +0000

    Rename hallubench gpt output path

commit eae08b536908875eeb600538e853caaa14c655ae
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 09:32:52 2024 +0000

    Fix hallusionbench gpt json saving path

commit 4240785c1bf3a7fd15f36013803c004542a17f2e
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:51:13 2024 +0000

    Resolve conflict

commit c15bf75d2f76e215b4d5de43c1d17b5a41d79753
Merge: 9cf86fa 93534dc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:37:21 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 93534dc4e98b78b9da01099079187d8960705fb8
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:55:03 2024 +0000

    Record gpt response in eval info

commit 05166a14c45063bf108282c3202d32feb2fe0afa
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:49:01 2024 +0000

    Fix refcocog dataset path

commit 5c6e0c8
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 8bd568e
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit '63fc8eee4dddfbe741e5a862e5ff30d19c34238e'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'd16bbce134d453c624834e090af1e0f869fdde15'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '7332704263a45ab6fa69aad0c4303cd9cbc26813'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 0e0c698
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* Update commands.md

* Add repr_scripts for reference

* Add timeout for gpt4V

* Remove unnecessary dependencies

* Add reproduce into readme

* Revise seedbench process_result

* Fix exclude dc hardcode postprocess logic error

* Fix metric repeat issue

* Update dataset runtime and add environment info

* Revise val submission file saving path

* Put the correct query into the gpt extraction

* Update sleep time in utils.py

* update

---------

Co-authored-by: Bo Li <drluodian@gmail.com>
Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
Co-authored-by: jzhang38 <a1286225768@gmail.com>
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
…volvingLMMs-Lab#62)

* Refactor logging in lmms_eval package

* Refactor variable names in lmms_eval package

* Update README.md with new features and installation instructions

* Update supported models and datasets

* Delete otter.py file

* Fix capitalization in README.md

* Update image sizes and add new features

* Refactor README.md to improve readability and add new features

* Add description for lmms-eval in README.md

* Update accelerator support in README.md

* Update lmms-eval README with improved description and additional features

* Update README.md with improved task grouping description

* change `Otter-AI/MME` to `lmms-lab/MME`

* Update README.md

* Update README.md

* Remove unused code in mme.yaml

* Squashed commit of the following:

commit 90fbf3d
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 0fa3bce
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 0182d5d
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* add llava main in pyproject

* Update README.md

* Remove unnecessary dependencies and add specific version for llava_repr

* Add dependencies for llava_repr***

* Update README.md

* add some docs on models and command line commands

* remove some lines

* typo

* Update model_guide.md

* Update model_guide.md

* Update README.md

* Update README.md

* Update README.md

* Fix refcocog dataset path

* Record gpt response in eval info

* Resolve conflict

* Fix hallusionbench gpt json saving path

* Rename hallubench gpt output path

* Change remove image to check by type instead of check by names

* More robust check by type

* Add timeout to API requests

* Remove unnecessary img in data

* Forcing an empty commit.

* Testing

* Delete unnecessary things

* Fix error logging in get_chat_response function

* Fix seedbench2 image issue in doc_to_text

* Add conditional exclude for internal eval

* Squashed commit of the following:

commit 74a747ff5e5a82cd8f61fb9f5a5357b67c867153
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:49:36 2024 +0000

    Add conditional exclude for internal eval

commit 336de4a8408ece3c0a2b7b5880c00b38015674a1
Merge: a3cae8e ffb9eb2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:24:29 2024 +0000

    Merge branch 'dev/readme' into kc/final_fix

commit 5860f00373890a18ed09870757bcdae9f3821aa1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 02:47:31 2024 +0000

    Fix seedbench2 image issue in doc_to_text

commit 912b73ed809e9242351874ce5b127c218188196d
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:32:49 2024 +0000

    Delete unnecessary things

commit f3f98531fc18a053b1a1bdec6c03757e1334e93b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:31:42 2024 +0000

    Testing

commit ceccc944119c22177e7fe040ba73e468dcf6d419
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:29:30 2024 +0000

    Forcing an empty commit.

commit d970b68e39068deb8308bb20af4266f4d37403df
Merge: 786f2b5 1700786
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:56 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit f0b9201adeb8e2e78886a6746ead6b585430f7d8
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:20 2024 +0000

    Remove unnecessary img in data

commit f9cdc0331bf9ef3f1cca4a3791658b2f31f300ca
Merge: 4240785 888c1c1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:41:24 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit fb4bb090b185f18b8be4ef3353ec659a40e1b508
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:40:51 2024 +0000

    More robust check by type

commit 3d58243e32f551f5427950663157c2a5ce539504
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:00:57 2024 +0000

    Change remove image to check by type instead of check by names

commit 95717b7ce70d40bc12e0b3b5809a686a083903aa
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 12:33:02 2024 +0000

    Rename hallubench gpt output path

commit 07915d5ec5d68ed0cde34bbb6e0c1438757fab72
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 09:32:52 2024 +0000

    Fix hallusionbench gpt json saving path

commit cc8ce2e48c31c5196ad5e0bca871acbe0c7492a1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:51:13 2024 +0000

    Resolve conflict

commit 562bb6c15876164ad49392df1a66ed6af84cac76
Merge: 9cf86fa 93534dc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:37:21 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit f2a585a4e5163b51dc31686a32a8aae7fd8e0751
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:55:03 2024 +0000

    Record gpt response in eval info

commit e3896d1421b5ba5794db227648ca4316a0170569
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:49:01 2024 +0000

    Fix refcocog dataset path

commit 90fbf3d
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 0fa3bce
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 0182d5d
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* Fix small bugs in list_with_num

* Revise list_with_num model args

* Dev/readme rm rolling (EvolvingLMMs-Lab#60)

* remove log_likelyhood_rolling

* Update time efficiency benchmark in README.md

* add task guide

---------

Co-authored-by: jzhang38 <a1286225768@gmail.com>
Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>

* Remove unnecessary code and update dependencies

* Fix logging utils bug on wandb grouping

* Add reproduce envs

* Squashed commit of the following:

commit d1d4ca79d569d5765080160bd8c7e8d432cadd99
Merge: 2475639 f89a736
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 22:12:12 2024 +0800

    Merge branch 'main' into kc/final_fix

commit d1815c3465e43a083ab811e8fc8602911a971413
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 22:11:04 2024 +0800

    Add reproduce envs

commit b8b7f79
Author: kcz358 <92624596+kcz358@users.noreply.github.com>
Date:   Sun Mar 3 21:19:15 2024 +0800

    [Fix] wandb group logging missing columns (EvolvingLMMs-Lab#61)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

    * Update README.md with new features and installation instructions

    * Update supported models and datasets

    * Delete otter.py file

    * Fix capitalization in README.md

    * Update image sizes and add new features

    * Refactor README.md to improve readability and add new features

    * Add description for lmms-eval in README.md

    * Update accelerator support in README.md

    * Update lmms-eval README with improved description and additional features

    * Update README.md with improved task grouping description

    * change `Otter-AI/MME` to `lmms-lab/MME`

    * Update README.md

    * Update README.md

    * Remove unused code in mme.yaml

    * Squashed commit of the following:

    commit 90fbf3d
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Thu Feb 29 13:40:02 2024 +0800

        Dev/py add models (EvolvingLMMs-Lab#57)

        * add instructblip

        * minicpm_v

        * remove <image> from qwen-vl

        * speed up postprocessing

        * Optimize build context speed

        ---------

        Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 0fa3bce
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Wed Feb 28 14:49:07 2024 +0800

        Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

        * refactor vizwizvqa task

        * Delete vqav2_test and vqav2_val YAML files

        * Refactor vqav2_process_results functions

        * Add a pack for vqav2

        * refactor okvqa

        * roll back vizwiz_vqa

        * Fix exact_match calculation in ok_vqa_process_results

        * Update OKVQA dataset name in readme

        * add model_specific_prompt_kwargs

        * add model_specific_prompt_kwargs to vizwiz_vqa

        * add model_specific_prompt_kwargs for vqav2

        * lint

        * fix a small bug for eval_logger

        * Refactor make_table function to display points as "  -  " if value is None

        * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

        * Refactor ok_vqa_aggreate_submissions function

        * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

        * Refactor VQA submission file saving

        * Update file utils

        * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

        * Refactor file path handling and submission generation

        * OKVQA path

        * vizwizvqa file

        * pack cmmmu

        * fix a small metric bug for cmmmu

        * Add higher_is_better flag to submission metric

        * Add CMMMU dataset to README.md

        * Add logging and refactor submission file generation in docvqa utils.py

        * pack docvqa

        * add traceback to print detailed error

        * Refactor docvqa_test_aggregate_results to accept additional arguments

        * Add metric check in evaluator.py and update test.yaml and val.yaml

        * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

        * merge textvqa

        * textvqa

        * Modify submission file generation for COCO test results

        * Update test result storage path

        * update coco cap file name

        * Update COCO 2017 Caption dataset name

        * ferret

        * Add Ferret dataset

        * Refactor hb_doc_to_text function to include model-specific prompts

        * Add IconQA and its subtasks

        * Refactor image list creation in doc_to_visual function

        * Add process_results function to default template

        * Update process_results function in iconqa utils.py

        * refactor flickr30k

        * change aggregation function

        * Fix formatting issues and update logging message

        * Fix llava can not handle only text question (no visuals)

        * Fix qwen can not handle no image question (no visuals)

        * Add fuyu prepare accelerator scripts

        * refactor mme

        * naming consistency

        * aggregation_submissions consistency

        * flickr30k naming consistency

        * remove submissions for mme

        * remove unused submission function

        * Refactor infovqa_test.yaml and infovqa_val.yaml

        * Refactor code for improved readability and maintainability

        * stvqa

        * remane sqa

        * Update lmms_eval textcaps files and utils.py

        * Update default prompt for text captions

        * Refactor textcaps_aggregation_result function

        * Add generate_submission_file function and update mathvista_aggregate_results signature

        * Update nocaps_test.yaml and nocaps_val.yaml

        * refractor internal_eval

        * Add internal evaluation datasets

        * pack multidocvqa

        * mmvet

        * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

        * Refractor llava wild

        * Refractor llava-bench-coco

        * Add JSON file generation for gpt evaluation details

        * mmmu

        * Remove MMBench English and Chinese tasks

        * Remove unnecessary return statement in mmbench_aggregate_test_results function

        * Fix distributed process group initialization

        * Update dataset paths and group names in mmbench test configs

        * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

        * Add torch module import

        * lint

        * Remove IconQA dataset from README.md

        * Add Multi-DocVQA and its submodules

        * Add new datasets and update task names

        * Refactor flickr_aggregation_result function to accept additional arguments

        * Add timeout kwargs in Accelerator constructor

        * Add encoding to be utf-8 for cmmmu

        * Fix llava try and catch, remove torch.distributed.init in main

        * Ds prepare script for llava

        ---------

        Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 0182d5d
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Feb 27 22:52:07 2024 +0800

        [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

        * Refactor logging in lmms_eval package

        * Refactor variable names in lmms_eval package

    * add llava main in pyproject

    * Update README.md

    * Remove unnecessary dependencies and add specific version for llava_repr

    * Add dependencies for llava_repr***

    * Update README.md

    * add some docs on models and command line commands

    * remove some lines

    * typo

    * Update model_guide.md

    * Update model_guide.md

    * Update README.md

    * Update README.md

    * Update README.md

    * Fix refcocog dataset path

    * Record gpt response in eval info

    * Resolve conflict

    * Fix hallusionbench gpt json saving path

    * Rename hallubench gpt output path

    * Change remove image to check by type instead of check by names

    * More robust check by type

    * Remove unnecessary img in data

    * Forcing an empty commit.

    * Testing

    * Delete unnecessary things

    * Fix seedbench2 image issue in doc_to_text

    * Add conditional exclude for internal eval

    * Fix small bugs in list_with_num

    * Revise list_with_num model args

    * Fix logging utils bug on wandb grouping

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Co-authored-by: jzhang38 <a1286225768@gmail.com>

commit bf67bcc02cb57e63952e4429515269458084ea5f
Merge: 83358a4 5e1c9c7
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 07:25:48 2024 +0000

    Merge branch 'main' into kc/final_fix

commit c3e54461dd77f62aa50bcee8fbbebc14e4470644
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 07:23:19 2024 +0000

    Fix logging utils bug on wandb grouping

commit 09eecf5
Author: kcz358 <92624596+kcz358@users.noreply.github.com>
Date:   Sun Mar 3 13:01:11 2024 +0800

    [Fix] refcocog dataset path, record gpt prompt in internal eval, build context issue (EvolvingLMMs-Lab#59)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

    * Update README.md with new features and installation instructions

    * Update supported models and datasets

    * Delete otter.py file

    * Fix capitalization in README.md

    * Update image sizes and add new features

    * Refactor README.md to improve readability and add new features

    * Add description for lmms-eval in README.md

    * Update accelerator support in README.md

    * Update lmms-eval README with improved description and additional features

    * Update README.md with improved task grouping description

    * change `Otter-AI/MME` to `lmms-lab/MME`

    * Update README.md

    * Update README.md

    * Remove unused code in mme.yaml

    * Squashed commit of the following:

    commit 90fbf3d
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Thu Feb 29 13:40:02 2024 +0800

        Dev/py add models (EvolvingLMMs-Lab#57)

        * add instructblip

        * minicpm_v

        * remove <image> from qwen-vl

        * speed up postprocessing

        * Optimize build context speed

        ---------

        Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 0fa3bce
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Wed Feb 28 14:49:07 2024 +0800

        Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

        * refactor vizwizvqa task

        * Delete vqav2_test and vqav2_val YAML files

        * Refactor vqav2_process_results functions

        * Add a pack for vqav2

        * refactor okvqa

        * roll back vizwiz_vqa

        * Fix exact_match calculation in ok_vqa_process_results

        * Update OKVQA dataset name in readme

        * add model_specific_prompt_kwargs

        * add model_specific_prompt_kwargs to vizwiz_vqa

        * add model_specific_prompt_kwargs for vqav2

        * lint

        * fix a small bug for eval_logger

        * Refactor make_table function to display points as "  -  " if value is None

        * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

        * Refactor ok_vqa_aggreate_submissions function

        * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

        * Refactor VQA submission file saving

        * Update file utils

        * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

        * Refactor file path handling and submission generation

        * OKVQA path

        * vizwizvqa file

        * pack cmmmu

        * fix a small metric bug for cmmmu

        * Add higher_is_better flag to submission metric

        * Add CMMMU dataset to README.md

        * Add logging and refactor submission file generation in docvqa utils.py

        * pack docvqa

        * add traceback to print detailed error

        * Refactor docvqa_test_aggregate_results to accept additional arguments

        * Add metric check in evaluator.py and update test.yaml and val.yaml

        * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

        * merge textvqa

        * textvqa

        * Modify submission file generation for COCO test results

        * Update test result storage path

        * update coco cap file name

        * Update COCO 2017 Caption dataset name

        * ferret

        * Add Ferret dataset

        * Refactor hb_doc_to_text function to include model-specific prompts

        * Add IconQA and its subtasks

        * Refactor image list creation in doc_to_visual function

        * Add process_results function to default template

        * Update process_results function in iconqa utils.py

        * refactor flickr30k

        * change aggregation function

        * Fix formatting issues and update logging message

        * Fix llava can not handle only text question (no visuals)

        * Fix qwen can not handle no image question (no visuals)

        * Add fuyu prepare accelerator scripts

        * refactor mme

        * naming consistency

        * aggregation_submissions consistency

        * flickr30k naming consistency

        * remove submissions for mme

        * remove unused submission function

        * Refactor infovqa_test.yaml and infovqa_val.yaml

        * Refactor code for improved readability and maintainability

        * stvqa

        * remane sqa

        * Update lmms_eval textcaps files and utils.py

        * Update default prompt for text captions

        * Refactor textcaps_aggregation_result function

        * Add generate_submission_file function and update mathvista_aggregate_results signature

        * Update nocaps_test.yaml and nocaps_val.yaml

        * refractor internal_eval

        * Add internal evaluation datasets

        * pack multidocvqa

        * mmvet

        * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

        * Refractor llava wild

        * Refractor llava-bench-coco

        * Add JSON file generation for gpt evaluation details

        * mmmu

        * Remove MMBench English and Chinese tasks

        * Remove unnecessary return statement in mmbench_aggregate_test_results function

        * Fix distributed process group initialization

        * Update dataset paths and group names in mmbench test configs

        * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

        * Add torch module import

        * lint

        * Remove IconQA dataset from README.md

        * Add Multi-DocVQA and its submodules

        * Add new datasets and update task names

        * Refactor flickr_aggregation_result function to accept additional arguments

        * Add timeout kwargs in Accelerator constructor

        * Add encoding to be utf-8 for cmmmu

        * Fix llava try and catch, remove torch.distributed.init in main

        * Ds prepare script for llava

        ---------

        Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 0182d5d
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Feb 27 22:52:07 2024 +0800

        [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

        * Refactor logging in lmms_eval package

        * Refactor variable names in lmms_eval package

    * add llava main in pyproject

    * Update README.md

    * Remove unnecessary dependencies and add specific version for llava_repr

    * Add dependencies for llava_repr***

    * Update README.md

    * add some docs on models and command line commands

    * remove some lines

    * typo

    * Update model_guide.md

    * Update model_guide.md

    * Update README.md

    * Update README.md

    * Update README.md

    * Fix refcocog dataset path

    * Record gpt response in eval info

    * Resolve conflict

    * Fix hallusionbench gpt json saving path

    * Rename hallubench gpt output path

    * Change remove image to check by type instead of check by names

    * More robust check by type

    * Remove unnecessary img in data

    * Forcing an empty commit.

    * Testing

    * Delete unnecessary things

    * Fix seedbench2 image issue in doc_to_text

    * Add conditional exclude for internal eval

    * Fix small bugs in list_with_num

    * Revise list_with_num model args

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Co-authored-by: jzhang38 <a1286225768@gmail.com>

commit a0ce88c84a9122b793a6b6d352896767fed1f18a
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 05:58:08 2024 +0000

    Revise list_with_num model args

commit b892d8eac7f656fafa5d6425b94b3d089e4a5268
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 05:09:15 2024 +0000

    Fix small bugs in list_with_num

commit 74a747ff5e5a82cd8f61fb9f5a5357b67c867153
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:49:36 2024 +0000

    Add conditional exclude for internal eval

commit 336de4a8408ece3c0a2b7b5880c00b38015674a1
Merge: a3cae8e ffb9eb2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:24:29 2024 +0000

    Merge branch 'dev/readme' into kc/final_fix

commit 5860f00373890a18ed09870757bcdae9f3821aa1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 02:47:31 2024 +0000

    Fix seedbench2 image issue in doc_to_text

commit 912b73ed809e9242351874ce5b127c218188196d
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:32:49 2024 +0000

    Delete unnecessary things

commit f3f98531fc18a053b1a1bdec6c03757e1334e93b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:31:42 2024 +0000

    Testing

commit ceccc944119c22177e7fe040ba73e468dcf6d419
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:29:30 2024 +0000

    Forcing an empty commit.

commit d970b68e39068deb8308bb20af4266f4d37403df
Merge: 786f2b5 1700786
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:56 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit f0b9201adeb8e2e78886a6746ead6b585430f7d8
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:20 2024 +0000

    Remove unnecessary img in data

commit f9cdc0331bf9ef3f1cca4a3791658b2f31f300ca
Merge: 4240785 888c1c1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:41:24 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit fb4bb090b185f18b8be4ef3353ec659a40e1b508
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:40:51 2024 +0000

    More robust check by type

commit 3d58243e32f551f5427950663157c2a5ce539504
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:00:57 2024 +0000

    Change remove image to check by type instead of check by names

commit 95717b7ce70d40bc12e0b3b5809a686a083903aa
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 12:33:02 2024 +0000

    Rename hallubench gpt output path

commit 07915d5ec5d68ed0cde34bbb6e0c1438757fab72
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 09:32:52 2024 +0000

    Fix hallusionbench gpt json saving path

commit cc8ce2e48c31c5196ad5e0bca871acbe0c7492a1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:51:13 2024 +0000

    Resolve conflict

commit 562bb6c15876164ad49392df1a66ed6af84cac76
Merge: 9cf86fa 93534dc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:37:21 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit f2a585a4e5163b51dc31686a32a8aae7fd8e0751
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:55:03 2024 +0000

    Record gpt response in eval info

commit e3896d1421b5ba5794db227648ca4316a0170569
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:49:01 2024 +0000

    Fix refcocog dataset path

commit 90fbf3d
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 0fa3bce
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 0182d5d
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* Update commands.md

* Add repr_scripts for reference

* Add timeout for gpt4V

* Remove unnecessary dependencies

* Add reproduce into readme

* Revise seedbench process_result

* Fix exclude dc hardcode postprocess logic error

* Fix metric repeat issue

* Update dataset runtime and add environment info

* Revise val submission file saving path

* Put the correct query into the gpt extraction

* Update sleep time in utils.py

* update

---------

Co-authored-by: Bo Li <drluodian@gmail.com>
Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
Co-authored-by: jzhang38 <a1286225768@gmail.com>
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
…volvingLMMs-Lab#62)

* Refactor logging in lmms_eval package

* Refactor variable names in lmms_eval package

* Update README.md with new features and installation instructions

* Update supported models and datasets

* Delete otter.py file

* Fix capitalization in README.md

* Update image sizes and add new features

* Refactor README.md to improve readability and add new features

* Add description for lmms-eval in README.md

* Update accelerator support in README.md

* Update lmms-eval README with improved description and additional features

* Update README.md with improved task grouping description

* change `Otter-AI/MME` to `lmms-lab/MME`

* Update README.md

* Update README.md

* Remove unused code in mme.yaml

* Squashed commit of the following:

commit 9c0bc58
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 30ab0ce
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit '5e73e8b8a2408bd8193361788669ca80db19cb04'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit '40099e8b8145bde513b9b7cef8461d8f13d1eafe'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit 'a56fe11c00ad4a8b8967be88b93baef6649528c5'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit a5b07ee
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* add llava main in pyproject

* Update README.md

* Remove unnecessary dependencies and add specific version for llava_repr

* Add dependencies for llava_repr***

* Update README.md

* add some docs on models and command line commands

* remove some lines

* typo

* Update model_guide.md

* Update model_guide.md

* Update README.md

* Update README.md

* Update README.md

* Fix refcocog dataset path

* Record gpt response in eval info

* Resolve conflict

* Fix hallusionbench gpt json saving path

* Rename hallubench gpt output path

* Change remove image to check by type instead of check by names

* More robust check by type

* Add timeout to API requests

* Remove unnecessary img in data

* Forcing an empty commit.

* Testing

* Delete unnecessary things

* Fix error logging in get_chat_response function

* Fix seedbench2 image issue in doc_to_text

* Add conditional exclude for internal eval

* Squashed commit of the following:

commit 1cf38b3ad6c7799957901d836299243cc21718f5
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:49:36 2024 +0000

    Add conditional exclude for internal eval

commit 62527c874431508b7731ad49ff1f1526104703cd
Merge: a3cae8e ffb9eb2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:24:29 2024 +0000

    Merge branch 'dev/readme' into kc/final_fix

commit 522f36aca8354f5efa7fff6d23bd90e885bcf1ab
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 02:47:31 2024 +0000

    Fix seedbench2 image issue in doc_to_text

commit 4ee323a5b19382dbd9ba62f5002042d0746c374e
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:32:49 2024 +0000

    Delete unnecessary things

commit 3d3e164489cb4bd2db342ae085da9613ee7de660
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:31:42 2024 +0000

    Testing

commit 8a4f586d7232a4d89977cef140900728d4517b72
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:29:30 2024 +0000

    Forcing an empty commit.

commit 33dd5b0e0006882e735b7ea1908fdb6ad37c825a
Merge: 786f2b5 1700786
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:56 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit f19de3e7aaf5151d5ce9c63a2b9ee393c6282dfa
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:20 2024 +0000

    Remove unnecessary img in data

commit e1f8cad15ddc2e385a3f2a778a4af57e1072987c
Merge: 4240785 888c1c1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:41:24 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 472b6b1ed2d5bc10ff1d6df8e435f33dc821ad4b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:40:51 2024 +0000

    More robust check by type

commit 367c021bd50068baf024bea3afde4ed58aa38b81
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:00:57 2024 +0000

    Change remove image to check by type instead of check by names

commit 0a466e16d983392cbf0580733500c0890521df93
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 12:33:02 2024 +0000

    Rename hallubench gpt output path

commit 6feceda2c1d631243c78fd7805dcdde4d0e8912f
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 09:32:52 2024 +0000

    Fix hallusionbench gpt json saving path

commit db1f731ee5aff4618edefed018e982f83add0c9a
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:51:13 2024 +0000

    Resolve conflict

commit c8a5e1129310ed1ce1fd86f43bb49da701140383
Merge: 9cf86fa 93534dc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:37:21 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit de53ceaeff08dc7c01962c704e06d7b87f804ec7
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:55:03 2024 +0000

    Record gpt response in eval info

commit e372631e911f2e03cc4f579e291e1198c4c11298
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:49:01 2024 +0000

    Fix refcocog dataset path

commit 9c0bc58
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 30ab0ce
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit '5e73e8b8a2408bd8193361788669ca80db19cb04'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit '40099e8b8145bde513b9b7cef8461d8f13d1eafe'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit 'a56fe11c00ad4a8b8967be88b93baef6649528c5'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit a5b07ee
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* Fix small bugs in list_with_num

* Revise list_with_num model args

* Dev/readme rm rolling (EvolvingLMMs-Lab#60)

* remove log_likelyhood_rolling

* Update time efficiency benchmark in README.md

* add task guide

---------

Co-authored-by: jzhang38 <a1286225768@gmail.com>
Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>

* Remove unnecessary code and update dependencies

* Fix logging utils bug on wandb grouping

* Add reproduce envs

* Squashed commit of the following:

commit cf18d7a1300311ffe1c9671fff7fa0c0d1cf2476
Merge: 2475639 f89a736
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 22:12:12 2024 +0800

    Merge branch 'main' into kc/final_fix

commit 35e5a937bcf924d6b787ce37c6da9f0f54674da9
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 22:11:04 2024 +0800

    Add reproduce envs

commit 13179f9
Author: kcz358 <92624596+kcz358@users.noreply.github.com>
Date:   Sun Mar 3 21:19:15 2024 +0800

    [Fix] wandb group logging missing columns (EvolvingLMMs-Lab#61)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

    * Update README.md with new features and installation instructions

    * Update supported models and datasets

    * Delete otter.py file

    * Fix capitalization in README.md

    * Update image sizes and add new features

    * Refactor README.md to improve readability and add new features

    * Add description for lmms-eval in README.md

    * Update accelerator support in README.md

    * Update lmms-eval README with improved description and additional features

    * Update README.md with improved task grouping description

    * change `Otter-AI/MME` to `lmms-lab/MME`

    * Update README.md

    * Update README.md

    * Remove unused code in mme.yaml

    * Squashed commit of the following:

    commit 9c0bc58
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Thu Feb 29 13:40:02 2024 +0800

        Dev/py add models (EvolvingLMMs-Lab#57)

        * add instructblip

        * minicpm_v

        * remove <image> from qwen-vl

        * speed up postprocessing

        * Optimize build context speed

        ---------

        Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 30ab0ce
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Wed Feb 28 14:49:07 2024 +0800

        Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

        * refactor vizwizvqa task

        * Delete vqav2_test and vqav2_val YAML files

        * Refactor vqav2_process_results functions

        * Add a pack for vqav2

        * refactor okvqa

        * roll back vizwiz_vqa

        * Fix exact_match calculation in ok_vqa_process_results

        * Update OKVQA dataset name in readme

        * add model_specific_prompt_kwargs

        * add model_specific_prompt_kwargs to vizwiz_vqa

        * add model_specific_prompt_kwargs for vqav2

        * lint

        * fix a small bug for eval_logger

        * Refactor make_table function to display points as "  -  " if value is None

        * Merge commit '5e73e8b8a2408bd8193361788669ca80db19cb04'

        * Refactor ok_vqa_aggreate_submissions function

        * Merge commit '40099e8b8145bde513b9b7cef8461d8f13d1eafe'

        * Refactor VQA submission file saving

        * Update file utils

        * Merge commit 'a56fe11c00ad4a8b8967be88b93baef6649528c5'

        * Refactor file path handling and submission generation

        * OKVQA path

        * vizwizvqa file

        * pack cmmmu

        * fix a small metric bug for cmmmu

        * Add higher_is_better flag to submission metric

        * Add CMMMU dataset to README.md

        * Add logging and refactor submission file generation in docvqa utils.py

        * pack docvqa

        * add traceback to print detailed error

        * Refactor docvqa_test_aggregate_results to accept additional arguments

        * Add metric check in evaluator.py and update test.yaml and val.yaml

        * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

        * merge textvqa

        * textvqa

        * Modify submission file generation for COCO test results

        * Update test result storage path

        * update coco cap file name

        * Update COCO 2017 Caption dataset name

        * ferret

        * Add Ferret dataset

        * Refactor hb_doc_to_text function to include model-specific prompts

        * Add IconQA and its subtasks

        * Refactor image list creation in doc_to_visual function

        * Add process_results function to default template

        * Update process_results function in iconqa utils.py

        * refactor flickr30k

        * change aggregation function

        * Fix formatting issues and update logging message

        * Fix llava can not handle only text question (no visuals)

        * Fix qwen can not handle no image question (no visuals)

        * Add fuyu prepare accelerator scripts

        * refactor mme

        * naming consistency

        * aggregation_submissions consistency

        * flickr30k naming consistency

        * remove submissions for mme

        * remove unused submission function

        * Refactor infovqa_test.yaml and infovqa_val.yaml

        * Refactor code for improved readability and maintainability

        * stvqa

        * remane sqa

        * Update lmms_eval textcaps files and utils.py

        * Update default prompt for text captions

        * Refactor textcaps_aggregation_result function

        * Add generate_submission_file function and update mathvista_aggregate_results signature

        * Update nocaps_test.yaml and nocaps_val.yaml

        * refractor internal_eval

        * Add internal evaluation datasets

        * pack multidocvqa

        * mmvet

        * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

        * Refractor llava wild

        * Refractor llava-bench-coco

        * Add JSON file generation for gpt evaluation details

        * mmmu

        * Remove MMBench English and Chinese tasks

        * Remove unnecessary return statement in mmbench_aggregate_test_results function

        * Fix distributed process group initialization

        * Update dataset paths and group names in mmbench test configs

        * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

        * Add torch module import

        * lint

        * Remove IconQA dataset from README.md

        * Add Multi-DocVQA and its submodules

        * Add new datasets and update task names

        * Refactor flickr_aggregation_result function to accept additional arguments

        * Add timeout kwargs in Accelerator constructor

        * Add encoding to be utf-8 for cmmmu

        * Fix llava try and catch, remove torch.distributed.init in main

        * Ds prepare script for llava

        ---------

        Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit a5b07ee
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Feb 27 22:52:07 2024 +0800

        [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

        * Refactor logging in lmms_eval package

        * Refactor variable names in lmms_eval package

    * add llava main in pyproject

    * Update README.md

    * Remove unnecessary dependencies and add specific version for llava_repr

    * Add dependencies for llava_repr***

    * Update README.md

    * add some docs on models and command line commands

    * remove some lines

    * typo

    * Update model_guide.md

    * Update model_guide.md

    * Update README.md

    * Update README.md

    * Update README.md

    * Fix refcocog dataset path

    * Record gpt response in eval info

    * Resolve conflict

    * Fix hallusionbench gpt json saving path

    * Rename hallubench gpt output path

    * Change remove image to check by type instead of check by names

    * More robust check by type

    * Remove unnecessary img in data

    * Forcing an empty commit.

    * Testing

    * Delete unnecessary things

    * Fix seedbench2 image issue in doc_to_text

    * Add conditional exclude for internal eval

    * Fix small bugs in list_with_num

    * Revise list_with_num model args

    * Fix logging utils bug on wandb grouping

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Co-authored-by: jzhang38 <a1286225768@gmail.com>

commit 39ce670fb1992c5e30d4b0eff9636a88a1ce83f5
Merge: 83358a4 5e1c9c7
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 07:25:48 2024 +0000

    Merge branch 'main' into kc/final_fix

commit 36eeaa08730cd3e6a7e90e7000f61b4ebb075524
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 07:23:19 2024 +0000

    Fix logging utils bug on wandb grouping

commit 9ac7212
Author: kcz358 <92624596+kcz358@users.noreply.github.com>
Date:   Sun Mar 3 13:01:11 2024 +0800

    [Fix] refcocog dataset path, record gpt prompt in internal eval, build context issue (EvolvingLMMs-Lab#59)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

    * Update README.md with new features and installation instructions

    * Update supported models and datasets

    * Delete otter.py file

    * Fix capitalization in README.md

    * Update image sizes and add new features

    * Refactor README.md to improve readability and add new features

    * Add description for lmms-eval in README.md

    * Update accelerator support in README.md

    * Update lmms-eval README with improved description and additional features

    * Update README.md with improved task grouping description

    * change `Otter-AI/MME` to `lmms-lab/MME`

    * Update README.md

    * Update README.md

    * Remove unused code in mme.yaml

    * Squashed commit of the following:

    commit 9c0bc58
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Thu Feb 29 13:40:02 2024 +0800

        Dev/py add models (EvolvingLMMs-Lab#57)

        * add instructblip

        * minicpm_v

        * remove <image> from qwen-vl

        * speed up postprocessing

        * Optimize build context speed

        ---------

        Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 30ab0ce
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Wed Feb 28 14:49:07 2024 +0800

        Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

        * refactor vizwizvqa task

        * Delete vqav2_test and vqav2_val YAML files

        * Refactor vqav2_process_results functions

        * Add a pack for vqav2

        * refactor okvqa

        * roll back vizwiz_vqa

        * Fix exact_match calculation in ok_vqa_process_results

        * Update OKVQA dataset name in readme

        * add model_specific_prompt_kwargs

        * add model_specific_prompt_kwargs to vizwiz_vqa

        * add model_specific_prompt_kwargs for vqav2

        * lint

        * fix a small bug for eval_logger

        * Refactor make_table function to display points as "  -  " if value is None

        * Merge commit '5e73e8b8a2408bd8193361788669ca80db19cb04'

        * Refactor ok_vqa_aggreate_submissions function

        * Merge commit '40099e8b8145bde513b9b7cef8461d8f13d1eafe'

        * Refactor VQA submission file saving

        * Update file utils

        * Merge commit 'a56fe11c00ad4a8b8967be88b93baef6649528c5'

        * Refactor file path handling and submission generation

        * OKVQA path

        * vizwizvqa file

        * pack cmmmu

        * fix a small metric bug for cmmmu

        * Add higher_is_better flag to submission metric

        * Add CMMMU dataset to README.md

        * Add logging and refactor submission file generation in docvqa utils.py

        * pack docvqa

        * add traceback to print detailed error

        * Refactor docvqa_test_aggregate_results to accept additional arguments

        * Add metric check in evaluator.py and update test.yaml and val.yaml

        * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

        * merge textvqa

        * textvqa

        * Modify submission file generation for COCO test results

        * Update test result storage path

        * update coco cap file name

        * Update COCO 2017 Caption dataset name

        * ferret

        * Add Ferret dataset

        * Refactor hb_doc_to_text function to include model-specific prompts

        * Add IconQA and its subtasks

        * Refactor image list creation in doc_to_visual function

        * Add process_results function to default template

        * Update process_results function in iconqa utils.py

        * refactor flickr30k

        * change aggregation function

        * Fix formatting issues and update logging message

        * Fix llava can not handle only text question (no visuals)

        * Fix qwen can not handle no image question (no visuals)

        * Add fuyu prepare accelerator scripts

        * refactor mme

        * naming consistency

        * aggregation_submissions consistency

        * flickr30k naming consistency

        * remove submissions for mme

        * remove unused submission function

        * Refactor infovqa_test.yaml and infovqa_val.yaml

        * Refactor code for improved readability and maintainability

        * stvqa

        * remane sqa

        * Update lmms_eval textcaps files and utils.py

        * Update default prompt for text captions

        * Refactor textcaps_aggregation_result function

        * Add generate_submission_file function and update mathvista_aggregate_results signature

        * Update nocaps_test.yaml and nocaps_val.yaml

        * refractor internal_eval

        * Add internal evaluation datasets

        * pack multidocvqa

        * mmvet

        * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

        * Refractor llava wild

        * Refractor llava-bench-coco

        * Add JSON file generation for gpt evaluation details

        * mmmu

        * Remove MMBench English and Chinese tasks

        * Remove unnecessary return statement in mmbench_aggregate_test_results function

        * Fix distributed process group initialization

        * Update dataset paths and group names in mmbench test configs

        * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

        * Add torch module import

        * lint

        * Remove IconQA dataset from README.md

        * Add Multi-DocVQA and its submodules

        * Add new datasets and update task names

        * Refactor flickr_aggregation_result function to accept additional arguments

        * Add timeout kwargs in Accelerator constructor

        * Add encoding to be utf-8 for cmmmu

        * Fix llava try and catch, remove torch.distributed.init in main

        * Ds prepare script for llava

        ---------

        Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit a5b07ee
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Feb 27 22:52:07 2024 +0800

        [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

        * Refactor logging in lmms_eval package

        * Refactor variable names in lmms_eval package

    * add llava main in pyproject

    * Update README.md

    * Remove unnecessary dependencies and add specific version for llava_repr

    * Add dependencies for llava_repr***

    * Update README.md

    * add some docs on models and command line commands

    * remove some lines

    * typo

    * Update model_guide.md

    * Update model_guide.md

    * Update README.md

    * Update README.md

    * Update README.md

    * Fix refcocog dataset path

    * Record gpt response in eval info

    * Resolve conflict

    * Fix hallusionbench gpt json saving path

    * Rename hallubench gpt output path

    * Change remove image to check by type instead of check by names

    * More robust check by type

    * Remove unnecessary img in data

    * Forcing an empty commit.

    * Testing

    * Delete unnecessary things

    * Fix seedbench2 image issue in doc_to_text

    * Add conditional exclude for internal eval

    * Fix small bugs in list_with_num

    * Revise list_with_num model args

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Co-authored-by: jzhang38 <a1286225768@gmail.com>

commit 22fda28d8aa2a53405f15d179ea9baaf53a19b0b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 05:58:08 2024 +0000

    Revise list_with_num model args

commit 48d92eb823b7929ea4c7b0da9f2284ec194c71cf
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 05:09:15 2024 +0000

    Fix small bugs in list_with_num

commit 1cf38b3ad6c7799957901d836299243cc21718f5
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:49:36 2024 +0000

    Add conditional exclude for internal eval

commit 62527c874431508b7731ad49ff1f1526104703cd
Merge: a3cae8e ffb9eb2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:24:29 2024 +0000

    Merge branch 'dev/readme' into kc/final_fix

commit 522f36aca8354f5efa7fff6d23bd90e885bcf1ab
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 02:47:31 2024 +0000

    Fix seedbench2 image issue in doc_to_text

commit 4ee323a5b19382dbd9ba62f5002042d0746c374e
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:32:49 2024 +0000

    Delete unnecessary things

commit 3d3e164489cb4bd2db342ae085da9613ee7de660
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:31:42 2024 +0000

    Testing

commit 8a4f586d7232a4d89977cef140900728d4517b72
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:29:30 2024 +0000

    Forcing an empty commit.

commit 33dd5b0e0006882e735b7ea1908fdb6ad37c825a
Merge: 786f2b5 1700786
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:56 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit f19de3e7aaf5151d5ce9c63a2b9ee393c6282dfa
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:20 2024 +0000

    Remove unnecessary img in data

commit e1f8cad15ddc2e385a3f2a778a4af57e1072987c
Merge: 4240785 888c1c1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:41:24 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 472b6b1ed2d5bc10ff1d6df8e435f33dc821ad4b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:40:51 2024 +0000

    More robust check by type

commit 367c021bd50068baf024bea3afde4ed58aa38b81
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:00:57 2024 +0000

    Change remove image to check by type instead of check by names

commit 0a466e16d983392cbf0580733500c0890521df93
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 12:33:02 2024 +0000

    Rename hallubench gpt output path

commit 6feceda2c1d631243c78fd7805dcdde4d0e8912f
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 09:32:52 2024 +0000

    Fix hallusionbench gpt json saving path

commit db1f731ee5aff4618edefed018e982f83add0c9a
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:51:13 2024 +0000

    Resolve conflict

commit c8a5e1129310ed1ce1fd86f43bb49da701140383
Merge: 9cf86fa 93534dc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:37:21 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit de53ceaeff08dc7c01962c704e06d7b87f804ec7
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:55:03 2024 +0000

    Record gpt response in eval info

commit e372631e911f2e03cc4f579e291e1198c4c11298
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:49:01 2024 +0000

    Fix refcocog dataset path

commit 9c0bc58
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 30ab0ce
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit '5e73e8b8a2408bd8193361788669ca80db19cb04'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit '40099e8b8145bde513b9b7cef8461d8f13d1eafe'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit 'a56fe11c00ad4a8b8967be88b93baef6649528c5'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit a5b07ee
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* Update commands.md

* Add repr_scripts for reference

* Add timeout for gpt4V

* Remove unnecessary dependencies

* Add reproduce into readme

* Revise seedbench process_result

* Fix exclude dc hardcode postprocess logic error

* Fix metric repeat issue

* Update dataset runtime and add environment info

* Revise val submission file saving path

* Put the correct query into the gpt extraction

* Update sleep time in utils.py

* update

---------

Co-authored-by: Bo Li <drluodian@gmail.com>
Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
Co-authored-by: jzhang38 <a1286225768@gmail.com>
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
…volvingLMMs-Lab#62)

* Refactor logging in lmms_eval package

* Refactor variable names in lmms_eval package

* Update README.md with new features and installation instructions

* Update supported models and datasets

* Delete otter.py file

* Fix capitalization in README.md

* Update image sizes and add new features

* Refactor README.md to improve readability and add new features

* Add description for lmms-eval in README.md

* Update accelerator support in README.md

* Update lmms-eval README with improved description and additional features

* Update README.md with improved task grouping description

* change `Otter-AI/MME` to `lmms-lab/MME`

* Update README.md

* Update README.md

* Remove unused code in mme.yaml

* Squashed commit of the following:

commit b3f1eff
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 0f26c8a
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit fefc964
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* add llava main in pyproject

* Update README.md

* Remove unnecessary dependencies and add specific version for llava_repr

* Add dependencies for llava_repr***

* Update README.md

* add some docs on models and command line commands

* remove some lines

* typo

* Update model_guide.md

* Update model_guide.md

* Update README.md

* Update README.md

* Update README.md

* Fix refcocog dataset path

* Record gpt response in eval info

* Resolve conflict

* Fix hallusionbench gpt json saving path

* Rename hallubench gpt output path

* Change remove image to check by type instead of check by names

* More robust check by type

* Add timeout to API requests

* Remove unnecessary img in data

* Forcing an empty commit.

* Testing

* Delete unnecessary things

* Fix error logging in get_chat_response function

* Fix seedbench2 image issue in doc_to_text

* Add conditional exclude for internal eval

* Squashed commit of the following:

commit 6bb0667ea746cc1dfa9442882f517edd47694d3e
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:49:36 2024 +0000

    Add conditional exclude for internal eval

commit e4ab9fc9ec7d77850ecc05bd33256909cdf62513
Merge: a3cae8e ffb9eb2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:24:29 2024 +0000

    Merge branch 'dev/readme' into kc/final_fix

commit 74c28de92a5794054d7c937b727fba3a8e5821c3
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 02:47:31 2024 +0000

    Fix seedbench2 image issue in doc_to_text

commit 279be1be1e2a839c97e58289362d6828e95e064a
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:32:49 2024 +0000

    Delete unnecessary things

commit 666f3146feef55f898f710254824d4b2c57e6747
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:31:42 2024 +0000

    Testing

commit 1f8d04d20feb6363615537ab47f8a1241c4ee692
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:29:30 2024 +0000

    Forcing an empty commit.

commit 985194e49f519ce04bdc2c0ce00eee3ab6c02def
Merge: 786f2b5 1700786
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:56 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit ef5a0a3b46acc36255c28781d8d66fc9bd32d47b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:20 2024 +0000

    Remove unnecessary img in data

commit e793fd1da7416d7938a6f9e98728692c04264a97
Merge: 4240785 888c1c1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:41:24 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit c1ae0a853bfdcc7d59e3d9fa0eaa78d4d1f01336
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:40:51 2024 +0000

    More robust check by type

commit 3ca0112d74b957f4d4ca20be5573deb8141793c7
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:00:57 2024 +0000

    Change remove image to check by type instead of check by names

commit 821398fde93ccd52eac2f4bbfb8c2e787a10b987
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 12:33:02 2024 +0000

    Rename hallubench gpt output path

commit 5172c13fb3b212c0d175987727433320a1faacbc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 09:32:52 2024 +0000

    Fix hallusionbench gpt json saving path

commit 12a243c8bee0be6ffacf17e46143519734c310d5
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:51:13 2024 +0000

    Resolve conflict

commit 2aded15347d10078c49606b690d05935ad29e6d1
Merge: 9cf86fa 93534dc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:37:21 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 9d499f198a9bdab2177bedfd3980c00934c684ff
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:55:03 2024 +0000

    Record gpt response in eval info

commit c5431b5b80cbaf6e11d840ecb1d0734d680ac41b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:49:01 2024 +0000

    Fix refcocog dataset path

commit b3f1eff
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 0f26c8a
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit fefc964
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* Fix small bugs in list_with_num

* Revise list_with_num model args

* Dev/readme rm rolling (EvolvingLMMs-Lab#60)

* remove log_likelyhood_rolling

* Update time efficiency benchmark in README.md

* add task guide

---------

Co-authored-by: jzhang38 <a1286225768@gmail.com>
Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>

* Remove unnecessary code and update dependencies

* Fix logging utils bug on wandb grouping

* Add reproduce envs

* Squashed commit of the following:

commit 084a21394643acd741fe0969dd0d3f6c6c734853
Merge: 2475639 f89a736
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 22:12:12 2024 +0800

    Merge branch 'main' into kc/final_fix

commit 803d0aec82a57de2ddf1527044f14ed968c30e25
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 22:11:04 2024 +0800

    Add reproduce envs

commit c5344f6
Author: kcz358 <92624596+kcz358@users.noreply.github.com>
Date:   Sun Mar 3 21:19:15 2024 +0800

    [Fix] wandb group logging missing columns (EvolvingLMMs-Lab#61)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

    * Update README.md with new features and installation instructions

    * Update supported models and datasets

    * Delete otter.py file

    * Fix capitalization in README.md

    * Update image sizes and add new features

    * Refactor README.md to improve readability and add new features

    * Add description for lmms-eval in README.md

    * Update accelerator support in README.md

    * Update lmms-eval README with improved description and additional features

    * Update README.md with improved task grouping description

    * change `Otter-AI/MME` to `lmms-lab/MME`

    * Update README.md

    * Update README.md

    * Remove unused code in mme.yaml

    * Squashed commit of the following:

    commit b3f1eff
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Thu Feb 29 13:40:02 2024 +0800

        Dev/py add models (EvolvingLMMs-Lab#57)

        * add instructblip

        * minicpm_v

        * remove <image> from qwen-vl

        * speed up postprocessing

        * Optimize build context speed

        ---------

        Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 0f26c8a
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Wed Feb 28 14:49:07 2024 +0800

        Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

        * refactor vizwizvqa task

        * Delete vqav2_test and vqav2_val YAML files

        * Refactor vqav2_process_results functions

        * Add a pack for vqav2

        * refactor okvqa

        * roll back vizwiz_vqa

        * Fix exact_match calculation in ok_vqa_process_results

        * Update OKVQA dataset name in readme

        * add model_specific_prompt_kwargs

        * add model_specific_prompt_kwargs to vizwiz_vqa

        * add model_specific_prompt_kwargs for vqav2

        * lint

        * fix a small bug for eval_logger

        * Refactor make_table function to display points as "  -  " if value is None

        * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

        * Refactor ok_vqa_aggreate_submissions function

        * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

        * Refactor VQA submission file saving

        * Update file utils

        * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

        * Refactor file path handling and submission generation

        * OKVQA path

        * vizwizvqa file

        * pack cmmmu

        * fix a small metric bug for cmmmu

        * Add higher_is_better flag to submission metric

        * Add CMMMU dataset to README.md

        * Add logging and refactor submission file generation in docvqa utils.py

        * pack docvqa

        * add traceback to print detailed error

        * Refactor docvqa_test_aggregate_results to accept additional arguments

        * Add metric check in evaluator.py and update test.yaml and val.yaml

        * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

        * merge textvqa

        * textvqa

        * Modify submission file generation for COCO test results

        * Update test result storage path

        * update coco cap file name

        * Update COCO 2017 Caption dataset name

        * ferret

        * Add Ferret dataset

        * Refactor hb_doc_to_text function to include model-specific prompts

        * Add IconQA and its subtasks

        * Refactor image list creation in doc_to_visual function

        * Add process_results function to default template

        * Update process_results function in iconqa utils.py

        * refactor flickr30k

        * change aggregation function

        * Fix formatting issues and update logging message

        * Fix llava can not handle only text question (no visuals)

        * Fix qwen can not handle no image question (no visuals)

        * Add fuyu prepare accelerator scripts

        * refactor mme

        * naming consistency

        * aggregation_submissions consistency

        * flickr30k naming consistency

        * remove submissions for mme

        * remove unused submission function

        * Refactor infovqa_test.yaml and infovqa_val.yaml

        * Refactor code for improved readability and maintainability

        * stvqa

        * remane sqa

        * Update lmms_eval textcaps files and utils.py

        * Update default prompt for text captions

        * Refactor textcaps_aggregation_result function

        * Add generate_submission_file function and update mathvista_aggregate_results signature

        * Update nocaps_test.yaml and nocaps_val.yaml

        * refractor internal_eval

        * Add internal evaluation datasets

        * pack multidocvqa

        * mmvet

        * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

        * Refractor llava wild

        * Refractor llava-bench-coco

        * Add JSON file generation for gpt evaluation details

        * mmmu

        * Remove MMBench English and Chinese tasks

        * Remove unnecessary return statement in mmbench_aggregate_test_results function

        * Fix distributed process group initialization

        * Update dataset paths and group names in mmbench test configs

        * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

        * Add torch module import

        * lint

        * Remove IconQA dataset from README.md

        * Add Multi-DocVQA and its submodules

        * Add new datasets and update task names

        * Refactor flickr_aggregation_result function to accept additional arguments

        * Add timeout kwargs in Accelerator constructor

        * Add encoding to be utf-8 for cmmmu

        * Fix llava try and catch, remove torch.distributed.init in main

        * Ds prepare script for llava

        ---------

        Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit fefc964
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Feb 27 22:52:07 2024 +0800

        [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

        * Refactor logging in lmms_eval package

        * Refactor variable names in lmms_eval package

    * add llava main in pyproject

    * Update README.md

    * Remove unnecessary dependencies and add specific version for llava_repr

    * Add dependencies for llava_repr***

    * Update README.md

    * add some docs on models and command line commands

    * remove some lines

    * typo

    * Update model_guide.md

    * Update model_guide.md

    * Update README.md

    * Update README.md

    * Update README.md

    * Fix refcocog dataset path

    * Record gpt response in eval info

    * Resolve conflict

    * Fix hallusionbench gpt json saving path

    * Rename hallubench gpt output path

    * Change remove image to check by type instead of check by names

    * More robust check by type

    * Remove unnecessary img in data

    * Forcing an empty commit.

    * Testing

    * Delete unnecessary things

    * Fix seedbench2 image issue in doc_to_text

    * Add conditional exclude for internal eval

    * Fix small bugs in list_with_num

    * Revise list_with_num model args

    * Fix logging utils bug on wandb grouping

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Co-authored-by: jzhang38 <a1286225768@gmail.com>

commit 7009af6bc533534e249b3070f122d825ce738ba0
Merge: 83358a4 5e1c9c7
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 07:25:48 2024 +0000

    Merge branch 'main' into kc/final_fix

commit 44b1e7fc5570130e64269c312c11fe0244c72c87
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 07:23:19 2024 +0000

    Fix logging utils bug on wandb grouping

commit 34476c7
Author: kcz358 <92624596+kcz358@users.noreply.github.com>
Date:   Sun Mar 3 13:01:11 2024 +0800

    [Fix] refcocog dataset path, record gpt prompt in internal eval, build context issue (EvolvingLMMs-Lab#59)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

    * Update README.md with new features and installation instructions

    * Update supported models and datasets

    * Delete otter.py file

    * Fix capitalization in README.md

    * Update image sizes and add new features

    * Refactor README.md to improve readability and add new features

    * Add description for lmms-eval in README.md

    * Update accelerator support in README.md

    * Update lmms-eval README with improved description and additional features

    * Update README.md with improved task grouping description

    * change `Otter-AI/MME` to `lmms-lab/MME`

    * Update README.md

    * Update README.md

    * Remove unused code in mme.yaml

    * Squashed commit of the following:

    commit b3f1eff
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Thu Feb 29 13:40:02 2024 +0800

        Dev/py add models (EvolvingLMMs-Lab#57)

        * add instructblip

        * minicpm_v

        * remove <image> from qwen-vl

        * speed up postprocessing

        * Optimize build context speed

        ---------

        Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 0f26c8a
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Wed Feb 28 14:49:07 2024 +0800

        Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

        * refactor vizwizvqa task

        * Delete vqav2_test and vqav2_val YAML files

        * Refactor vqav2_process_results functions

        * Add a pack for vqav2

        * refactor okvqa

        * roll back vizwiz_vqa

        * Fix exact_match calculation in ok_vqa_process_results

        * Update OKVQA dataset name in readme

        * add model_specific_prompt_kwargs

        * add model_specific_prompt_kwargs to vizwiz_vqa

        * add model_specific_prompt_kwargs for vqav2

        * lint

        * fix a small bug for eval_logger

        * Refactor make_table function to display points as "  -  " if value is None

        * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

        * Refactor ok_vqa_aggreate_submissions function

        * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

        * Refactor VQA submission file saving

        * Update file utils

        * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

        * Refactor file path handling and submission generation

        * OKVQA path

        * vizwizvqa file

        * pack cmmmu

        * fix a small metric bug for cmmmu

        * Add higher_is_better flag to submission metric

        * Add CMMMU dataset to README.md

        * Add logging and refactor submission file generation in docvqa utils.py

        * pack docvqa

        * add traceback to print detailed error

        * Refactor docvqa_test_aggregate_results to accept additional arguments

        * Add metric check in evaluator.py and update test.yaml and val.yaml

        * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

        * merge textvqa

        * textvqa

        * Modify submission file generation for COCO test results

        * Update test result storage path

        * update coco cap file name

        * Update COCO 2017 Caption dataset name

        * ferret

        * Add Ferret dataset

        * Refactor hb_doc_to_text function to include model-specific prompts

        * Add IconQA and its subtasks

        * Refactor image list creation in doc_to_visual function

        * Add process_results function to default template

        * Update process_results function in iconqa utils.py

        * refactor flickr30k

        * change aggregation function

        * Fix formatting issues and update logging message

        * Fix llava can not handle only text question (no visuals)

        * Fix qwen can not handle no image question (no visuals)

        * Add fuyu prepare accelerator scripts

        * refactor mme

        * naming consistency

        * aggregation_submissions consistency

        * flickr30k naming consistency

        * remove submissions for mme

        * remove unused submission function

        * Refactor infovqa_test.yaml and infovqa_val.yaml

        * Refactor code for improved readability and maintainability

        * stvqa

        * remane sqa

        * Update lmms_eval textcaps files and utils.py

        * Update default prompt for text captions

        * Refactor textcaps_aggregation_result function

        * Add generate_submission_file function and update mathvista_aggregate_results signature

        * Update nocaps_test.yaml and nocaps_val.yaml

        * refractor internal_eval

        * Add internal evaluation datasets

        * pack multidocvqa

        * mmvet

        * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

        * Refractor llava wild

        * Refractor llava-bench-coco

        * Add JSON file generation for gpt evaluation details

        * mmmu

        * Remove MMBench English and Chinese tasks

        * Remove unnecessary return statement in mmbench_aggregate_test_results function

        * Fix distributed process group initialization

        * Update dataset paths and group names in mmbench test configs

        * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

        * Add torch module import

        * lint

        * Remove IconQA dataset from README.md

        * Add Multi-DocVQA and its submodules

        * Add new datasets and update task names

        * Refactor flickr_aggregation_result function to accept additional arguments

        * Add timeout kwargs in Accelerator constructor

        * Add encoding to be utf-8 for cmmmu

        * Fix llava try and catch, remove torch.distributed.init in main

        * Ds prepare script for llava

        ---------

        Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit fefc964
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Feb 27 22:52:07 2024 +0800

        [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

        * Refactor logging in lmms_eval package

        * Refactor variable names in lmms_eval package

    * add llava main in pyproject

    * Update README.md

    * Remove unnecessary dependencies and add specific version for llava_repr

    * Add dependencies for llava_repr***

    * Update README.md

    * add some docs on models and command line commands

    * remove some lines

    * typo

    * Update model_guide.md

    * Update model_guide.md

    * Update README.md

    * Update README.md

    * Update README.md

    * Fix refcocog dataset path

    * Record gpt response in eval info

    * Resolve conflict

    * Fix hallusionbench gpt json saving path

    * Rename hallubench gpt output path

    * Change remove image to check by type instead of check by names

    * More robust check by type

    * Remove unnecessary img in data

    * Forcing an empty commit.

    * Testing

    * Delete unnecessary things

    * Fix seedbench2 image issue in doc_to_text

    * Add conditional exclude for internal eval

    * Fix small bugs in list_with_num

    * Revise list_with_num model args

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Co-authored-by: jzhang38 <a1286225768@gmail.com>

commit be339f832f760190e81bbfbeffb7049f7cccee60
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 05:58:08 2024 +0000

    Revise list_with_num model args

commit f301a5614054538cd7c18d3ac7b1f02305e68224
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 05:09:15 2024 +0000

    Fix small bugs in list_with_num

commit 6bb0667ea746cc1dfa9442882f517edd47694d3e
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:49:36 2024 +0000

    Add conditional exclude for internal eval

commit e4ab9fc9ec7d77850ecc05bd33256909cdf62513
Merge: a3cae8e ffb9eb2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:24:29 2024 +0000

    Merge branch 'dev/readme' into kc/final_fix

commit 74c28de92a5794054d7c937b727fba3a8e5821c3
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 02:47:31 2024 +0000

    Fix seedbench2 image issue in doc_to_text

commit 279be1be1e2a839c97e58289362d6828e95e064a
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:32:49 2024 +0000

    Delete unnecessary things

commit 666f3146feef55f898f710254824d4b2c57e6747
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:31:42 2024 +0000

    Testing

commit 1f8d04d20feb6363615537ab47f8a1241c4ee692
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:29:30 2024 +0000

    Forcing an empty commit.

commit 985194e49f519ce04bdc2c0ce00eee3ab6c02def
Merge: 786f2b5 1700786
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:56 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit ef5a0a3b46acc36255c28781d8d66fc9bd32d47b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:20 2024 +0000

    Remove unnecessary img in data

commit e793fd1da7416d7938a6f9e98728692c04264a97
Merge: 4240785 888c1c1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:41:24 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit c1ae0a853bfdcc7d59e3d9fa0eaa78d4d1f01336
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:40:51 2024 +0000

    More robust check by type

commit 3ca0112d74b957f4d4ca20be5573deb8141793c7
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:00:57 2024 +0000

    Change remove image to check by type instead of check by names

commit 821398fde93ccd52eac2f4bbfb8c2e787a10b987
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 12:33:02 2024 +0000

    Rename hallubench gpt output path

commit 5172c13fb3b212c0d175987727433320a1faacbc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 09:32:52 2024 +0000

    Fix hallusionbench gpt json saving path

commit 12a243c8bee0be6ffacf17e46143519734c310d5
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:51:13 2024 +0000

    Resolve conflict

commit 2aded15347d10078c49606b690d05935ad29e6d1
Merge: 9cf86fa 93534dc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:37:21 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 9d499f198a9bdab2177bedfd3980c00934c684ff
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:55:03 2024 +0000

    Record gpt response in eval info

commit c5431b5b80cbaf6e11d840ecb1d0734d680ac41b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:49:01 2024 +0000

    Fix refcocog dataset path

commit b3f1eff
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 0f26c8a
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit fefc964
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* Update commands.md

* Add repr_scripts for reference

* Add timeout for gpt4V

* Remove unnecessary dependencies

* Add reproduce into readme

* Revise seedbench process_result

* Fix exclude dc hardcode postprocess logic error

* Fix metric repeat issue

* Update dataset runtime and add environment info

* Revise val submission file saving path

* Put the correct query into the gpt extraction

* Update sleep time in utils.py

* update

---------

Co-authored-by: Bo Li <drluodian@gmail.com>
Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
Co-authored-by: jzhang38 <a1286225768@gmail.com>
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
…volvingLMMs-Lab#62)

* Refactor logging in lmms_eval package

* Refactor variable names in lmms_eval package

* Update README.md with new features and installation instructions

* Update supported models and datasets

* Delete otter.py file

* Fix capitalization in README.md

* Update image sizes and add new features

* Refactor README.md to improve readability and add new features

* Add description for lmms-eval in README.md

* Update accelerator support in README.md

* Update lmms-eval README with improved description and additional features

* Update README.md with improved task grouping description

* change `Otter-AI/MME` to `lmms-lab/MME`

* Update README.md

* Update README.md

* Remove unused code in mme.yaml

* Squashed commit of the following:

commit fffe545
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit c608dd6
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'b636596c46dce543cdfacc0809c5b14edafcf1fd'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit '5624cd5b92ff6b1bc1d431a615d938fd623a03c4'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '034d73b022739333da5e60f432330b8ea832ef9b'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit a0959f1
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* add llava main in pyproject

* Update README.md

* Remove unnecessary dependencies and add specific version for llava_repr

* Add dependencies for llava_repr***

* Update README.md

* add some docs on models and command line commands

* remove some lines

* typo

* Update model_guide.md

* Update model_guide.md

* Update README.md

* Update README.md

* Update README.md

* Fix refcocog dataset path

* Record gpt response in eval info

* Resolve conflict

* Fix hallusionbench gpt json saving path

* Rename hallubench gpt output path

* Change remove image to check by type instead of check by names

* More robust check by type

* Add timeout to API requests

* Remove unnecessary img in data

* Forcing an empty commit.

* Testing

* Delete unnecessary things

* Fix error logging in get_chat_response function

* Fix seedbench2 image issue in doc_to_text

* Add conditional exclude for internal eval

* Squashed commit of the following:

commit decb360fd834d968cc59dee6a06d40a326177ec5
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:49:36 2024 +0000

    Add conditional exclude for internal eval

commit 1e1ecf0de94b5e493ce0590269b3a2b9d030e31d
Merge: a3cae8e ffb9eb2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:24:29 2024 +0000

    Merge branch 'dev/readme' into kc/final_fix

commit ade2b08994f0b92f20d373cbc3cc8e2a8b665f49
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 02:47:31 2024 +0000

    Fix seedbench2 image issue in doc_to_text

commit 3bca65bca4b9b4cab80d50172dabda5c549c539f
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:32:49 2024 +0000

    Delete unnecessary things

commit 0ee12be56664eac6a79599b48ea22985f18ec358
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:31:42 2024 +0000

    Testing

commit 62cb1058ac416027ad981e3ba31ce029dfe83cf3
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:29:30 2024 +0000

    Forcing an empty commit.

commit 55447e7039321ed8d46c8dccaf75113288bdb502
Merge: 786f2b5 1700786
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:56 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 26963ddb5315e39ad9142e0fa1391fe2b8201c54
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:20 2024 +0000

    Remove unnecessary img in data

commit 7abbd695dfa73e09687a4d4f73c6bc99e63c811a
Merge: 4240785 888c1c1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:41:24 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit d1b19061661c1da1d3b7e9cd5d126ec475b6e1de
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:40:51 2024 +0000

    More robust check by type

commit 2564e74c7e8c07a51200560be70d2be13501fd9a
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:00:57 2024 +0000

    Change remove image to check by type instead of check by names

commit 4885702fcd36cfaf5bf2e498621fa0a831e8617c
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 12:33:02 2024 +0000

    Rename hallubench gpt output path

commit 46fc13424e6fecaa15d290f2330bc440ce9bd6e6
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 09:32:52 2024 +0000

    Fix hallusionbench gpt json saving path

commit 7cbeb3a05fc13fa9d0d44a17a7cd25e7550c435b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:51:13 2024 +0000

    Resolve conflict

commit e8a88505cbd71029682eaaddc8fe2c5cd41ccf5d
Merge: 9cf86fa 93534dc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:37:21 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit efc341983b959fb2cc9cc208879a86a01c251494
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:55:03 2024 +0000

    Record gpt response in eval info

commit 2b92f718f478b9f7999b17560439db366d2165a3
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:49:01 2024 +0000

    Fix refcocog dataset path

commit fffe545
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit c608dd6
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'b636596c46dce543cdfacc0809c5b14edafcf1fd'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit '5624cd5b92ff6b1bc1d431a615d938fd623a03c4'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '034d73b022739333da5e60f432330b8ea832ef9b'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit a0959f1
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* Fix small bugs in list_with_num

* Revise list_with_num model args

* Dev/readme rm rolling (EvolvingLMMs-Lab#60)

* remove log_likelyhood_rolling

* Update time efficiency benchmark in README.md

* add task guide

---------

Co-authored-by: jzhang38 <a1286225768@gmail.com>
Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>

* Remove unnecessary code and update dependencies

* Fix logging utils bug on wandb grouping

* Add reproduce envs

* Squashed commit of the following:

commit 4a1f385be0df3374ebf428599cfe35febdae0582
Merge: 2475639 f89a736
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 22:12:12 2024 +0800

    Merge branch 'main' into kc/final_fix

commit 19f7d8cd771fddd6cc6c3fee8f3c51fa4ad83eaa
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 22:11:04 2024 +0800

    Add reproduce envs

commit 1b605af
Author: kcz358 <92624596+kcz358@users.noreply.github.com>
Date:   Sun Mar 3 21:19:15 2024 +0800

    [Fix] wandb group logging missing columns (EvolvingLMMs-Lab#61)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

    * Update README.md with new features and installation instructions

    * Update supported models and datasets

    * Delete otter.py file

    * Fix capitalization in README.md

    * Update image sizes and add new features

    * Refactor README.md to improve readability and add new features

    * Add description for lmms-eval in README.md

    * Update accelerator support in README.md

    * Update lmms-eval README with improved description and additional features

    * Update README.md with improved task grouping description

    * change `Otter-AI/MME` to `lmms-lab/MME`

    * Update README.md

    * Update README.md

    * Remove unused code in mme.yaml

    * Squashed commit of the following:

    commit fffe545
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Thu Feb 29 13:40:02 2024 +0800

        Dev/py add models (EvolvingLMMs-Lab#57)

        * add instructblip

        * minicpm_v

        * remove <image> from qwen-vl

        * speed up postprocessing

        * Optimize build context speed

        ---------

        Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit c608dd6
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Wed Feb 28 14:49:07 2024 +0800

        Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

        * refactor vizwizvqa task

        * Delete vqav2_test and vqav2_val YAML files

        * Refactor vqav2_process_results functions

        * Add a pack for vqav2

        * refactor okvqa

        * roll back vizwiz_vqa

        * Fix exact_match calculation in ok_vqa_process_results

        * Update OKVQA dataset name in readme

        * add model_specific_prompt_kwargs

        * add model_specific_prompt_kwargs to vizwiz_vqa

        * add model_specific_prompt_kwargs for vqav2

        * lint

        * fix a small bug for eval_logger

        * Refactor make_table function to display points as "  -  " if value is None

        * Merge commit 'b636596c46dce543cdfacc0809c5b14edafcf1fd'

        * Refactor ok_vqa_aggreate_submissions function

        * Merge commit '5624cd5b92ff6b1bc1d431a615d938fd623a03c4'

        * Refactor VQA submission file saving

        * Update file utils

        * Merge commit '034d73b022739333da5e60f432330b8ea832ef9b'

        * Refactor file path handling and submission generation

        * OKVQA path

        * vizwizvqa file

        * pack cmmmu

        * fix a small metric bug for cmmmu

        * Add higher_is_better flag to submission metric

        * Add CMMMU dataset to README.md

        * Add logging and refactor submission file generation in docvqa utils.py

        * pack docvqa

        * add traceback to print detailed error

        * Refactor docvqa_test_aggregate_results to accept additional arguments

        * Add metric check in evaluator.py and update test.yaml and val.yaml

        * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

        * merge textvqa

        * textvqa

        * Modify submission file generation for COCO test results

        * Update test result storage path

        * update coco cap file name

        * Update COCO 2017 Caption dataset name

        * ferret

        * Add Ferret dataset

        * Refactor hb_doc_to_text function to include model-specific prompts

        * Add IconQA and its subtasks

        * Refactor image list creation in doc_to_visual function

        * Add process_results function to default template

        * Update process_results function in iconqa utils.py

        * refactor flickr30k

        * change aggregation function

        * Fix formatting issues and update logging message

        * Fix llava can not handle only text question (no visuals)

        * Fix qwen can not handle no image question (no visuals)

        * Add fuyu prepare accelerator scripts

        * refactor mme

        * naming consistency

        * aggregation_submissions consistency

        * flickr30k naming consistency

        * remove submissions for mme

        * remove unused submission function

        * Refactor infovqa_test.yaml and infovqa_val.yaml

        * Refactor code for improved readability and maintainability

        * stvqa

        * remane sqa

        * Update lmms_eval textcaps files and utils.py

        * Update default prompt for text captions

        * Refactor textcaps_aggregation_result function

        * Add generate_submission_file function and update mathvista_aggregate_results signature

        * Update nocaps_test.yaml and nocaps_val.yaml

        * refractor internal_eval

        * Add internal evaluation datasets

        * pack multidocvqa

        * mmvet

        * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

        * Refractor llava wild

        * Refractor llava-bench-coco

        * Add JSON file generation for gpt evaluation details

        * mmmu

        * Remove MMBench English and Chinese tasks

        * Remove unnecessary return statement in mmbench_aggregate_test_results function

        * Fix distributed process group initialization

        * Update dataset paths and group names in mmbench test configs

        * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

        * Add torch module import

        * lint

        * Remove IconQA dataset from README.md

        * Add Multi-DocVQA and its submodules

        * Add new datasets and update task names

        * Refactor flickr_aggregation_result function to accept additional arguments

        * Add timeout kwargs in Accelerator constructor

        * Add encoding to be utf-8 for cmmmu

        * Fix llava try and catch, remove torch.distributed.init in main

        * Ds prepare script for llava

        ---------

        Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit a0959f1
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Feb 27 22:52:07 2024 +0800

        [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

        * Refactor logging in lmms_eval package

        * Refactor variable names in lmms_eval package

    * add llava main in pyproject

    * Update README.md

    * Remove unnecessary dependencies and add specific version for llava_repr

    * Add dependencies for llava_repr***

    * Update README.md

    * add some docs on models and command line commands

    * remove some lines

    * typo

    * Update model_guide.md

    * Update model_guide.md

    * Update README.md

    * Update README.md

    * Update README.md

    * Fix refcocog dataset path

    * Record gpt response in eval info

    * Resolve conflict

    * Fix hallusionbench gpt json saving path

    * Rename hallubench gpt output path

    * Change remove image to check by type instead of check by names

    * More robust check by type

    * Remove unnecessary img in data

    * Forcing an empty commit.

    * Testing

    * Delete unnecessary things

    * Fix seedbench2 image issue in doc_to_text

    * Add conditional exclude for internal eval

    * Fix small bugs in list_with_num

    * Revise list_with_num model args

    * Fix logging utils bug on wandb grouping

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Co-authored-by: jzhang38 <a1286225768@gmail.com>

commit d1fffce8c61bd7e1e32f76c953c5b26773be58d5
Merge: 83358a4 5e1c9c7
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 07:25:48 2024 +0000

    Merge branch 'main' into kc/final_fix

commit 5a4df5d39e813844002af1a02ef4ce0c69feaa6d
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 07:23:19 2024 +0000

    Fix logging utils bug on wandb grouping

commit b923ad1
Author: kcz358 <92624596+kcz358@users.noreply.github.com>
Date:   Sun Mar 3 13:01:11 2024 +0800

    [Fix] refcocog dataset path, record gpt prompt in internal eval, build context issue (EvolvingLMMs-Lab#59)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

    * Update README.md with new features and installation instructions

    * Update supported models and datasets

    * Delete otter.py file

    * Fix capitalization in README.md

    * Update image sizes and add new features

    * Refactor README.md to improve readability and add new features

    * Add description for lmms-eval in README.md

    * Update accelerator support in README.md

    * Update lmms-eval README with improved description and additional features

    * Update README.md with improved task grouping description

    * change `Otter-AI/MME` to `lmms-lab/MME`

    * Update README.md

    * Update README.md

    * Remove unused code in mme.yaml

    * Squashed commit of the following:

    commit fffe545
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Thu Feb 29 13:40:02 2024 +0800

        Dev/py add models (EvolvingLMMs-Lab#57)

        * add instructblip

        * minicpm_v

        * remove <image> from qwen-vl

        * speed up postprocessing

        * Optimize build context speed

        ---------

        Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit c608dd6
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Wed Feb 28 14:49:07 2024 +0800

        Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

        * refactor vizwizvqa task

        * Delete vqav2_test and vqav2_val YAML files

        * Refactor vqav2_process_results functions

        * Add a pack for vqav2

        * refactor okvqa

        * roll back vizwiz_vqa

        * Fix exact_match calculation in ok_vqa_process_results

        * Update OKVQA dataset name in readme

        * add model_specific_prompt_kwargs

        * add model_specific_prompt_kwargs to vizwiz_vqa

        * add model_specific_prompt_kwargs for vqav2

        * lint

        * fix a small bug for eval_logger

        * Refactor make_table function to display points as "  -  " if value is None

        * Merge commit 'b636596c46dce543cdfacc0809c5b14edafcf1fd'

        * Refactor ok_vqa_aggreate_submissions function

        * Merge commit '5624cd5b92ff6b1bc1d431a615d938fd623a03c4'

        * Refactor VQA submission file saving

        * Update file utils

        * Merge commit '034d73b022739333da5e60f432330b8ea832ef9b'

        * Refactor file path handling and submission generation

        * OKVQA path

        * vizwizvqa file

        * pack cmmmu

        * fix a small metric bug for cmmmu

        * Add higher_is_better flag to submission metric

        * Add CMMMU dataset to README.md

        * Add logging and refactor submission file generation in docvqa utils.py

        * pack docvqa

        * add traceback to print detailed error

        * Refactor docvqa_test_aggregate_results to accept additional arguments

        * Add metric check in evaluator.py and update test.yaml and val.yaml

        * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

        * merge textvqa

        * textvqa

        * Modify submission file generation for COCO test results

        * Update test result storage path

        * update coco cap file name

        * Update COCO 2017 Caption dataset name

        * ferret

        * Add Ferret dataset

        * Refactor hb_doc_to_text function to include model-specific prompts

        * Add IconQA and its subtasks

        * Refactor image list creation in doc_to_visual function

        * Add process_results function to default template

        * Update process_results function in iconqa utils.py

        * refactor flickr30k

        * change aggregation function

        * Fix formatting issues and update logging message

        * Fix llava can not handle only text question (no visuals)

        * Fix qwen can not handle no image question (no visuals)

        * Add fuyu prepare accelerator scripts

        * refactor mme

        * naming consistency

        * aggregation_submissions consistency

        * flickr30k naming consistency

        * remove submissions for mme

        * remove unused submission function

        * Refactor infovqa_test.yaml and infovqa_val.yaml

        * Refactor code for improved readability and maintainability

        * stvqa

        * remane sqa

        * Update lmms_eval textcaps files and utils.py

        * Update default prompt for text captions

        * Refactor textcaps_aggregation_result function

        * Add generate_submission_file function and update mathvista_aggregate_results signature

        * Update nocaps_test.yaml and nocaps_val.yaml

        * refractor internal_eval

        * Add internal evaluation datasets

        * pack multidocvqa

        * mmvet

        * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

        * Refractor llava wild

        * Refractor llava-bench-coco

        * Add JSON file generation for gpt evaluation details

        * mmmu

        * Remove MMBench English and Chinese tasks

        * Remove unnecessary return statement in mmbench_aggregate_test_results function

        * Fix distributed process group initialization

        * Update dataset paths and group names in mmbench test configs

        * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

        * Add torch module import

        * lint

        * Remove IconQA dataset from README.md

        * Add Multi-DocVQA and its submodules

        * Add new datasets and update task names

        * Refactor flickr_aggregation_result function to accept additional arguments

        * Add timeout kwargs in Accelerator constructor

        * Add encoding to be utf-8 for cmmmu

        * Fix llava try and catch, remove torch.distributed.init in main

        * Ds prepare script for llava

        ---------

        Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit a0959f1
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Feb 27 22:52:07 2024 +0800

        [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

        * Refactor logging in lmms_eval package

        * Refactor variable names in lmms_eval package

    * add llava main in pyproject

    * Update README.md

    * Remove unnecessary dependencies and add specific version for llava_repr

    * Add dependencies for llava_repr***

    * Update README.md

    * add some docs on models and command line commands

    * remove some lines

    * typo

    * Update model_guide.md

    * Update model_guide.md

    * Update README.md

    * Update README.md

    * Update README.md

    * Fix refcocog dataset path

    * Record gpt response in eval info

    * Resolve conflict

    * Fix hallusionbench gpt json saving path

    * Rename hallubench gpt output path

    * Change remove image to check by type instead of check by names

    * More robust check by type

    * Remove unnecessary img in data

    * Forcing an empty commit.

    * Testing

    * Delete unnecessary things

    * Fix seedbench2 image issue in doc_to_text

    * Add conditional exclude for internal eval

    * Fix small bugs in list_with_num

    * Revise list_with_num model args

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Co-authored-by: jzhang38 <a1286225768@gmail.com>

commit 7f852ee91653357b6ee954ec92bcf2e5bab4bbcf
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 05:58:08 2024 +0000

    Revise list_with_num model args

commit 79c737c915565b191ab29113c98615a1c6acc994
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 05:09:15 2024 +0000

    Fix small bugs in list_with_num

commit decb360fd834d968cc59dee6a06d40a326177ec5
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:49:36 2024 +0000

    Add conditional exclude for internal eval

commit 1e1ecf0de94b5e493ce0590269b3a2b9d030e31d
Merge: a3cae8e ffb9eb2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:24:29 2024 +0000

    Merge branch 'dev/readme' into kc/final_fix

commit ade2b08994f0b92f20d373cbc3cc8e2a8b665f49
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 02:47:31 2024 +0000

    Fix seedbench2 image issue in doc_to_text

commit 3bca65bca4b9b4cab80d50172dabda5c549c539f
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:32:49 2024 +0000

    Delete unnecessary things

commit 0ee12be56664eac6a79599b48ea22985f18ec358
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:31:42 2024 +0000

    Testing

commit 62cb1058ac416027ad981e3ba31ce029dfe83cf3
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:29:30 2024 +0000

    Forcing an empty commit.

commit 55447e7039321ed8d46c8dccaf75113288bdb502
Merge: 786f2b5 1700786
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:56 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 26963ddb5315e39ad9142e0fa1391fe2b8201c54
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:20 2024 +0000

    Remove unnecessary img in data

commit 7abbd695dfa73e09687a4d4f73c6bc99e63c811a
Merge: 4240785 888c1c1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:41:24 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit d1b19061661c1da1d3b7e9cd5d126ec475b6e1de
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:40:51 2024 +0000

    More robust check by type

commit 2564e74c7e8c07a51200560be70d2be13501fd9a
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:00:57 2024 +0000

    Change remove image to check by type instead of check by names

commit 4885702fcd36cfaf5bf2e498621fa0a831e8617c
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 12:33:02 2024 +0000

    Rename hallubench gpt output path

commit 46fc13424e6fecaa15d290f2330bc440ce9bd6e6
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 09:32:52 2024 +0000

    Fix hallusionbench gpt json saving path

commit 7cbeb3a05fc13fa9d0d44a17a7cd25e7550c435b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:51:13 2024 +0000

    Resolve conflict

commit e8a88505cbd71029682eaaddc8fe2c5cd41ccf5d
Merge: 9cf86fa 93534dc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:37:21 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit efc341983b959fb2cc9cc208879a86a01c251494
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:55:03 2024 +0000

    Record gpt response in eval info

commit 2b92f718f478b9f7999b17560439db366d2165a3
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:49:01 2024 +0000

    Fix refcocog dataset path

commit fffe545
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit c608dd6
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'b636596c46dce543cdfacc0809c5b14edafcf1fd'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit '5624cd5b92ff6b1bc1d431a615d938fd623a03c4'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '034d73b022739333da5e60f432330b8ea832ef9b'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit a0959f1
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* Update commands.md

* Add repr_scripts for reference

* Add timeout for gpt4V

* Remove unnecessary dependencies

* Add reproduce into readme

* Revise seedbench process_result

* Fix exclude dc hardcode postprocess logic error

* Fix metric repeat issue

* Update dataset runtime and add environment info

* Revise val submission file saving path

* Put the correct query into the gpt extraction

* Update sleep time in utils.py

* update

---------

Co-authored-by: Bo Li <drluodian@gmail.com>
Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
Co-authored-by: jzhang38 <a1286225768@gmail.com>
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
…volvingLMMs-Lab#62)

* Refactor logging in lmms_eval package

* Refactor variable names in lmms_eval package

* Update README.md with new features and installation instructions

* Update supported models and datasets

* Delete otter.py file

* Fix capitalization in README.md

* Update image sizes and add new features

* Refactor README.md to improve readability and add new features

* Add description for lmms-eval in README.md

* Update accelerator support in README.md

* Update lmms-eval README with improved description and additional features

* Update README.md with improved task grouping description

* change `Otter-AI/MME` to `lmms-lab/MME`

* Update README.md

* Update README.md

* Remove unused code in mme.yaml

* Squashed commit of the following:

commit f6a7654
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 6dbf2a9
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit cbe3e52
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* add llava main in pyproject

* Update README.md

* Remove unnecessary dependencies and add specific version for llava_repr

* Add dependencies for llava_repr***

* Update README.md

* add some docs on models and command line commands

* remove some lines

* typo

* Update model_guide.md

* Update model_guide.md

* Update README.md

* Update README.md

* Update README.md

* Fix refcocog dataset path

* Record gpt response in eval info

* Resolve conflict

* Fix hallusionbench gpt json saving path

* Rename hallubench gpt output path

* Change remove image to check by type instead of check by names

* More robust check by type

* Add timeout to API requests

* Remove unnecessary img in data

* Forcing an empty commit.

* Testing

* Delete unnecessary things

* Fix error logging in get_chat_response function

* Fix seedbench2 image issue in doc_to_text

* Add conditional exclude for internal eval

* Squashed commit of the following:

commit 74a747ff5e5a82cd8f61fb9f5a5357b67c867153
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:49:36 2024 +0000

    Add conditional exclude for internal eval

commit 336de4a8408ece3c0a2b7b5880c00b38015674a1
Merge: a3cae8e ffb9eb2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:24:29 2024 +0000

    Merge branch 'dev/readme' into kc/final_fix

commit 5860f00373890a18ed09870757bcdae9f3821aa1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 02:47:31 2024 +0000

    Fix seedbench2 image issue in doc_to_text

commit 912b73ed809e9242351874ce5b127c218188196d
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:32:49 2024 +0000

    Delete unnecessary things

commit f3f98531fc18a053b1a1bdec6c03757e1334e93b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:31:42 2024 +0000

    Testing

commit ceccc944119c22177e7fe040ba73e468dcf6d419
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:29:30 2024 +0000

    Forcing an empty commit.

commit d970b68e39068deb8308bb20af4266f4d37403df
Merge: 786f2b5 1700786
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:56 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit f0b9201adeb8e2e78886a6746ead6b585430f7d8
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:20 2024 +0000

    Remove unnecessary img in data

commit f9cdc0331bf9ef3f1cca4a3791658b2f31f300ca
Merge: 4240785 888c1c1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:41:24 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit fb4bb090b185f18b8be4ef3353ec659a40e1b508
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:40:51 2024 +0000

    More robust check by type

commit 3d58243e32f551f5427950663157c2a5ce539504
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:00:57 2024 +0000

    Change remove image to check by type instead of check by names

commit 95717b7ce70d40bc12e0b3b5809a686a083903aa
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 12:33:02 2024 +0000

    Rename hallubench gpt output path

commit 07915d5ec5d68ed0cde34bbb6e0c1438757fab72
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 09:32:52 2024 +0000

    Fix hallusionbench gpt json saving path

commit cc8ce2e48c31c5196ad5e0bca871acbe0c7492a1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:51:13 2024 +0000

    Resolve conflict

commit 562bb6c15876164ad49392df1a66ed6af84cac76
Merge: 9cf86fa 93534dc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:37:21 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit f2a585a4e5163b51dc31686a32a8aae7fd8e0751
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:55:03 2024 +0000

    Record gpt response in eval info

commit e3896d1421b5ba5794db227648ca4316a0170569
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:49:01 2024 +0000

    Fix refcocog dataset path

commit f6a7654
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 6dbf2a9
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit cbe3e52
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* Fix small bugs in list_with_num

* Revise list_with_num model args

* Dev/readme rm rolling (EvolvingLMMs-Lab#60)

* remove log_likelyhood_rolling

* Update time efficiency benchmark in README.md

* add task guide

---------

Co-authored-by: jzhang38 <a1286225768@gmail.com>
Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>

* Remove unnecessary code and update dependencies

* Fix logging utils bug on wandb grouping

* Add reproduce envs

* Squashed commit of the following:

commit d1d4ca79d569d5765080160bd8c7e8d432cadd99
Merge: 2475639 f89a736
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 22:12:12 2024 +0800

    Merge branch 'main' into kc/final_fix

commit d1815c3465e43a083ab811e8fc8602911a971413
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 22:11:04 2024 +0800

    Add reproduce envs

commit 27dbf48
Author: kcz358 <92624596+kcz358@users.noreply.github.com>
Date:   Sun Mar 3 21:19:15 2024 +0800

    [Fix] wandb group logging missing columns (EvolvingLMMs-Lab#61)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

    * Update README.md with new features and installation instructions

    * Update supported models and datasets

    * Delete otter.py file

    * Fix capitalization in README.md

    * Update image sizes and add new features

    * Refactor README.md to improve readability and add new features

    * Add description for lmms-eval in README.md

    * Update accelerator support in README.md

    * Update lmms-eval README with improved description and additional features

    * Update README.md with improved task grouping description

    * change `Otter-AI/MME` to `lmms-lab/MME`

    * Update README.md

    * Update README.md

    * Remove unused code in mme.yaml

    * Squashed commit of the following:

    commit f6a7654
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Thu Feb 29 13:40:02 2024 +0800

        Dev/py add models (EvolvingLMMs-Lab#57)

        * add instructblip

        * minicpm_v

        * remove <image> from qwen-vl

        * speed up postprocessing

        * Optimize build context speed

        ---------

        Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 6dbf2a9
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Wed Feb 28 14:49:07 2024 +0800

        Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

        * refactor vizwizvqa task

        * Delete vqav2_test and vqav2_val YAML files

        * Refactor vqav2_process_results functions

        * Add a pack for vqav2

        * refactor okvqa

        * roll back vizwiz_vqa

        * Fix exact_match calculation in ok_vqa_process_results

        * Update OKVQA dataset name in readme

        * add model_specific_prompt_kwargs

        * add model_specific_prompt_kwargs to vizwiz_vqa

        * add model_specific_prompt_kwargs for vqav2

        * lint

        * fix a small bug for eval_logger

        * Refactor make_table function to display points as "  -  " if value is None

        * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

        * Refactor ok_vqa_aggreate_submissions function

        * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

        * Refactor VQA submission file saving

        * Update file utils

        * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

        * Refactor file path handling and submission generation

        * OKVQA path

        * vizwizvqa file

        * pack cmmmu

        * fix a small metric bug for cmmmu

        * Add higher_is_better flag to submission metric

        * Add CMMMU dataset to README.md

        * Add logging and refactor submission file generation in docvqa utils.py

        * pack docvqa

        * add traceback to print detailed error

        * Refactor docvqa_test_aggregate_results to accept additional arguments

        * Add metric check in evaluator.py and update test.yaml and val.yaml

        * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

        * merge textvqa

        * textvqa

        * Modify submission file generation for COCO test results

        * Update test result storage path

        * update coco cap file name

        * Update COCO 2017 Caption dataset name

        * ferret

        * Add Ferret dataset

        * Refactor hb_doc_to_text function to include model-specific prompts

        * Add IconQA and its subtasks

        * Refactor image list creation in doc_to_visual function

        * Add process_results function to default template

        * Update process_results function in iconqa utils.py

        * refactor flickr30k

        * change aggregation function

        * Fix formatting issues and update logging message

        * Fix llava can not handle only text question (no visuals)

        * Fix qwen can not handle no image question (no visuals)

        * Add fuyu prepare accelerator scripts

        * refactor mme

        * naming consistency

        * aggregation_submissions consistency

        * flickr30k naming consistency

        * remove submissions for mme

        * remove unused submission function

        * Refactor infovqa_test.yaml and infovqa_val.yaml

        * Refactor code for improved readability and maintainability

        * stvqa

        * remane sqa

        * Update lmms_eval textcaps files and utils.py

        * Update default prompt for text captions

        * Refactor textcaps_aggregation_result function

        * Add generate_submission_file function and update mathvista_aggregate_results signature

        * Update nocaps_test.yaml and nocaps_val.yaml

        * refractor internal_eval

        * Add internal evaluation datasets

        * pack multidocvqa

        * mmvet

        * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

        * Refractor llava wild

        * Refractor llava-bench-coco

        * Add JSON file generation for gpt evaluation details

        * mmmu

        * Remove MMBench English and Chinese tasks

        * Remove unnecessary return statement in mmbench_aggregate_test_results function

        * Fix distributed process group initialization

        * Update dataset paths and group names in mmbench test configs

        * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

        * Add torch module import

        * lint

        * Remove IconQA dataset from README.md

        * Add Multi-DocVQA and its submodules

        * Add new datasets and update task names

        * Refactor flickr_aggregation_result function to accept additional arguments

        * Add timeout kwargs in Accelerator constructor

        * Add encoding to be utf-8 for cmmmu

        * Fix llava try and catch, remove torch.distributed.init in main

        * Ds prepare script for llava

        ---------

        Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit cbe3e52
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Feb 27 22:52:07 2024 +0800

        [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

        * Refactor logging in lmms_eval package

        * Refactor variable names in lmms_eval package

    * add llava main in pyproject

    * Update README.md

    * Remove unnecessary dependencies and add specific version for llava_repr

    * Add dependencies for llava_repr***

    * Update README.md

    * add some docs on models and command line commands

    * remove some lines

    * typo

    * Update model_guide.md

    * Update model_guide.md

    * Update README.md

    * Update README.md

    * Update README.md

    * Fix refcocog dataset path

    * Record gpt response in eval info

    * Resolve conflict

    * Fix hallusionbench gpt json saving path

    * Rename hallubench gpt output path

    * Change remove image to check by type instead of check by names

    * More robust check by type

    * Remove unnecessary img in data

    * Forcing an empty commit.

    * Testing

    * Delete unnecessary things

    * Fix seedbench2 image issue in doc_to_text

    * Add conditional exclude for internal eval

    * Fix small bugs in list_with_num

    * Revise list_with_num model args

    * Fix logging utils bug on wandb grouping

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Co-authored-by: jzhang38 <a1286225768@gmail.com>

commit bf67bcc02cb57e63952e4429515269458084ea5f
Merge: 83358a4 5e1c9c7
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 07:25:48 2024 +0000

    Merge branch 'main' into kc/final_fix

commit c3e54461dd77f62aa50bcee8fbbebc14e4470644
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 07:23:19 2024 +0000

    Fix logging utils bug on wandb grouping

commit 2a94fb0
Author: kcz358 <92624596+kcz358@users.noreply.github.com>
Date:   Sun Mar 3 13:01:11 2024 +0800

    [Fix] refcocog dataset path, record gpt prompt in internal eval, build context issue (EvolvingLMMs-Lab#59)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

    * Update README.md with new features and installation instructions

    * Update supported models and datasets

    * Delete otter.py file

    * Fix capitalization in README.md

    * Update image sizes and add new features

    * Refactor README.md to improve readability and add new features

    * Add description for lmms-eval in README.md

    * Update accelerator support in README.md

    * Update lmms-eval README with improved description and additional features

    * Update README.md with improved task grouping description

    * change `Otter-AI/MME` to `lmms-lab/MME`

    * Update README.md

    * Update README.md

    * Remove unused code in mme.yaml

    * Squashed commit of the following:

    commit f6a7654
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Thu Feb 29 13:40:02 2024 +0800

        Dev/py add models (EvolvingLMMs-Lab#57)

        * add instructblip

        * minicpm_v

        * remove <image> from qwen-vl

        * speed up postprocessing

        * Optimize build context speed

        ---------

        Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 6dbf2a9
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Wed Feb 28 14:49:07 2024 +0800

        Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

        * refactor vizwizvqa task

        * Delete vqav2_test and vqav2_val YAML files

        * Refactor vqav2_process_results functions

        * Add a pack for vqav2

        * refactor okvqa

        * roll back vizwiz_vqa

        * Fix exact_match calculation in ok_vqa_process_results

        * Update OKVQA dataset name in readme

        * add model_specific_prompt_kwargs

        * add model_specific_prompt_kwargs to vizwiz_vqa

        * add model_specific_prompt_kwargs for vqav2

        * lint

        * fix a small bug for eval_logger

        * Refactor make_table function to display points as "  -  " if value is None

        * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

        * Refactor ok_vqa_aggreate_submissions function

        * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

        * Refactor VQA submission file saving

        * Update file utils

        * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

        * Refactor file path handling and submission generation

        * OKVQA path

        * vizwizvqa file

        * pack cmmmu

        * fix a small metric bug for cmmmu

        * Add higher_is_better flag to submission metric

        * Add CMMMU dataset to README.md

        * Add logging and refactor submission file generation in docvqa utils.py

        * pack docvqa

        * add traceback to print detailed error

        * Refactor docvqa_test_aggregate_results to accept additional arguments

        * Add metric check in evaluator.py and update test.yaml and val.yaml

        * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

        * merge textvqa

        * textvqa

        * Modify submission file generation for COCO test results

        * Update test result storage path

        * update coco cap file name

        * Update COCO 2017 Caption dataset name

        * ferret

        * Add Ferret dataset

        * Refactor hb_doc_to_text function to include model-specific prompts

        * Add IconQA and its subtasks

        * Refactor image list creation in doc_to_visual function

        * Add process_results function to default template

        * Update process_results function in iconqa utils.py

        * refactor flickr30k

        * change aggregation function

        * Fix formatting issues and update logging message

        * Fix llava can not handle only text question (no visuals)

        * Fix qwen can not handle no image question (no visuals)

        * Add fuyu prepare accelerator scripts

        * refactor mme

        * naming consistency

        * aggregation_submissions consistency

        * flickr30k naming consistency

        * remove submissions for mme

        * remove unused submission function

        * Refactor infovqa_test.yaml and infovqa_val.yaml

        * Refactor code for improved readability and maintainability

        * stvqa

        * remane sqa

        * Update lmms_eval textcaps files and utils.py

        * Update default prompt for text captions

        * Refactor textcaps_aggregation_result function

        * Add generate_submission_file function and update mathvista_aggregate_results signature

        * Update nocaps_test.yaml and nocaps_val.yaml

        * refractor internal_eval

        * Add internal evaluation datasets

        * pack multidocvqa

        * mmvet

        * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

        * Refractor llava wild

        * Refractor llava-bench-coco

        * Add JSON file generation for gpt evaluation details

        * mmmu

        * Remove MMBench English and Chinese tasks

        * Remove unnecessary return statement in mmbench_aggregate_test_results function

        * Fix distributed process group initialization

        * Update dataset paths and group names in mmbench test configs

        * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

        * Add torch module import

        * lint

        * Remove IconQA dataset from README.md

        * Add Multi-DocVQA and its submodules

        * Add new datasets and update task names

        * Refactor flickr_aggregation_result function to accept additional arguments

        * Add timeout kwargs in Accelerator constructor

        * Add encoding to be utf-8 for cmmmu

        * Fix llava try and catch, remove torch.distributed.init in main

        * Ds prepare script for llava

        ---------

        Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit cbe3e52
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Feb 27 22:52:07 2024 +0800

        [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

        * Refactor logging in lmms_eval package

        * Refactor variable names in lmms_eval package

    * add llava main in pyproject

    * Update README.md

    * Remove unnecessary dependencies and add specific version for llava_repr

    * Add dependencies for llava_repr***

    * Update README.md

    * add some docs on models and command line commands

    * remove some lines

    * typo

    * Update model_guide.md

    * Update model_guide.md

    * Update README.md

    * Update README.md

    * Update README.md

    * Fix refcocog dataset path

    * Record gpt response in eval info

    * Resolve conflict

    * Fix hallusionbench gpt json saving path

    * Rename hallubench gpt output path

    * Change remove image to check by type instead of check by names

    * More robust check by type

    * Remove unnecessary img in data

    * Forcing an empty commit.

    * Testing

    * Delete unnecessary things

    * Fix seedbench2 image issue in doc_to_text

    * Add conditional exclude for internal eval

    * Fix small bugs in list_with_num

    * Revise list_with_num model args

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Co-authored-by: jzhang38 <a1286225768@gmail.com>

commit a0ce88c84a9122b793a6b6d352896767fed1f18a
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 05:58:08 2024 +0000

    Revise list_with_num model args

commit b892d8eac7f656fafa5d6425b94b3d089e4a5268
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 05:09:15 2024 +0000

    Fix small bugs in list_with_num

commit 74a747ff5e5a82cd8f61fb9f5a5357b67c867153
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:49:36 2024 +0000

    Add conditional exclude for internal eval

commit 336de4a8408ece3c0a2b7b5880c00b38015674a1
Merge: a3cae8e ffb9eb2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:24:29 2024 +0000

    Merge branch 'dev/readme' into kc/final_fix

commit 5860f00373890a18ed09870757bcdae9f3821aa1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 02:47:31 2024 +0000

    Fix seedbench2 image issue in doc_to_text

commit 912b73ed809e9242351874ce5b127c218188196d
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:32:49 2024 +0000

    Delete unnecessary things

commit f3f98531fc18a053b1a1bdec6c03757e1334e93b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:31:42 2024 +0000

    Testing

commit ceccc944119c22177e7fe040ba73e468dcf6d419
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:29:30 2024 +0000

    Forcing an empty commit.

commit d970b68e39068deb8308bb20af4266f4d37403df
Merge: 786f2b5 1700786
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:56 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit f0b9201adeb8e2e78886a6746ead6b585430f7d8
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:20 2024 +0000

    Remove unnecessary img in data

commit f9cdc0331bf9ef3f1cca4a3791658b2f31f300ca
Merge: 4240785 888c1c1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:41:24 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit fb4bb090b185f18b8be4ef3353ec659a40e1b508
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:40:51 2024 +0000

    More robust check by type

commit 3d58243e32f551f5427950663157c2a5ce539504
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:00:57 2024 +0000

    Change remove image to check by type instead of check by names

commit 95717b7ce70d40bc12e0b3b5809a686a083903aa
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 12:33:02 2024 +0000

    Rename hallubench gpt output path

commit 07915d5ec5d68ed0cde34bbb6e0c1438757fab72
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 09:32:52 2024 +0000

    Fix hallusionbench gpt json saving path

commit cc8ce2e48c31c5196ad5e0bca871acbe0c7492a1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:51:13 2024 +0000

    Resolve conflict

commit 562bb6c15876164ad49392df1a66ed6af84cac76
Merge: 9cf86fa 93534dc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:37:21 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit f2a585a4e5163b51dc31686a32a8aae7fd8e0751
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:55:03 2024 +0000

    Record gpt response in eval info

commit e3896d1421b5ba5794db227648ca4316a0170569
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:49:01 2024 +0000

    Fix refcocog dataset path

commit f6a7654
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 6dbf2a9
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit cbe3e52
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* Update commands.md

* Add repr_scripts for reference

* Add timeout for gpt4V

* Remove unnecessary dependencies

* Add reproduce into readme

* Revise seedbench process_result

* Fix exclude dc hardcode postprocess logic error

* Fix metric repeat issue

* Update dataset runtime and add environment info

* Revise val submission file saving path

* Put the correct query into the gpt extraction

* Update sleep time in utils.py

* update

---------

Co-authored-by: Bo Li <drluodian@gmail.com>
Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
Co-authored-by: jzhang38 <a1286225768@gmail.com>
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
…volvingLMMs-Lab#62)

* Refactor logging in lmms_eval package

* Refactor variable names in lmms_eval package

* Update README.md with new features and installation instructions

* Update supported models and datasets

* Delete otter.py file

* Fix capitalization in README.md

* Update image sizes and add new features

* Refactor README.md to improve readability and add new features

* Add description for lmms-eval in README.md

* Update accelerator support in README.md

* Update lmms-eval README with improved description and additional features

* Update README.md with improved task grouping description

* change `Otter-AI/MME` to `lmms-lab/MME`

* Update README.md

* Update README.md

* Remove unused code in mme.yaml

* Squashed commit of the following:

commit a68962a
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 0b02105
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit '5e73e8b8a2408bd8193361788669ca80db19cb04'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit '40099e8b8145bde513b9b7cef8461d8f13d1eafe'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit 'a56fe11c00ad4a8b8967be88b93baef6649528c5'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit f4af7d0
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* add llava main in pyproject

* Update README.md

* Remove unnecessary dependencies and add specific version for llava_repr

* Add dependencies for llava_repr***

* Update README.md

* add some docs on models and command line commands

* remove some lines

* typo

* Update model_guide.md

* Update model_guide.md

* Update README.md

* Update README.md

* Update README.md

* Fix refcocog dataset path

* Record gpt response in eval info

* Resolve conflict

* Fix hallusionbench gpt json saving path

* Rename hallubench gpt output path

* Change remove image to check by type instead of check by names

* More robust check by type

* Add timeout to API requests

* Remove unnecessary img in data

* Forcing an empty commit.

* Testing

* Delete unnecessary things

* Fix error logging in get_chat_response function

* Fix seedbench2 image issue in doc_to_text

* Add conditional exclude for internal eval

* Squashed commit of the following:

commit 1cf38b3ad6c7799957901d836299243cc21718f5
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:49:36 2024 +0000

    Add conditional exclude for internal eval

commit 62527c874431508b7731ad49ff1f1526104703cd
Merge: a3cae8e ffb9eb2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:24:29 2024 +0000

    Merge branch 'dev/readme' into kc/final_fix

commit 522f36aca8354f5efa7fff6d23bd90e885bcf1ab
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 02:47:31 2024 +0000

    Fix seedbench2 image issue in doc_to_text

commit 4ee323a5b19382dbd9ba62f5002042d0746c374e
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:32:49 2024 +0000

    Delete unnecessary things

commit 3d3e164489cb4bd2db342ae085da9613ee7de660
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:31:42 2024 +0000

    Testing

commit 8a4f586d7232a4d89977cef140900728d4517b72
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:29:30 2024 +0000

    Forcing an empty commit.

commit 33dd5b0e0006882e735b7ea1908fdb6ad37c825a
Merge: 786f2b5 1700786
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:56 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit f19de3e7aaf5151d5ce9c63a2b9ee393c6282dfa
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:20 2024 +0000

    Remove unnecessary img in data

commit e1f8cad15ddc2e385a3f2a778a4af57e1072987c
Merge: 4240785 888c1c1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:41:24 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 472b6b1ed2d5bc10ff1d6df8e435f33dc821ad4b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:40:51 2024 +0000

    More robust check by type

commit 367c021bd50068baf024bea3afde4ed58aa38b81
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:00:57 2024 +0000

    Change remove image to check by type instead of check by names

commit 0a466e16d983392cbf0580733500c0890521df93
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 12:33:02 2024 +0000

    Rename hallubench gpt output path

commit 6feceda2c1d631243c78fd7805dcdde4d0e8912f
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 09:32:52 2024 +0000

    Fix hallusionbench gpt json saving path

commit db1f731ee5aff4618edefed018e982f83add0c9a
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:51:13 2024 +0000

    Resolve conflict

commit c8a5e1129310ed1ce1fd86f43bb49da701140383
Merge: 9cf86fa 93534dc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:37:21 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit de53ceaeff08dc7c01962c704e06d7b87f804ec7
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:55:03 2024 +0000

    Record gpt response in eval info

commit e372631e911f2e03cc4f579e291e1198c4c11298
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:49:01 2024 +0000

    Fix refcocog dataset path

commit a68962a
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 0b02105
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit '5e73e8b8a2408bd8193361788669ca80db19cb04'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit '40099e8b8145bde513b9b7cef8461d8f13d1eafe'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit 'a56fe11c00ad4a8b8967be88b93baef6649528c5'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit f4af7d0
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* Fix small bugs in list_with_num

* Revise list_with_num model args

* Dev/readme rm rolling (EvolvingLMMs-Lab#60)

* remove log_likelyhood_rolling

* Update time efficiency benchmark in README.md

* add task guide

---------

Co-authored-by: jzhang38 <a1286225768@gmail.com>
Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>

* Remove unnecessary code and update dependencies

* Fix logging utils bug on wandb grouping

* Add reproduce envs

* Squashed commit of the following:

commit cf18d7a1300311ffe1c9671fff7fa0c0d1cf2476
Merge: 2475639 f89a736
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 22:12:12 2024 +0800

    Merge branch 'main' into kc/final_fix

commit 35e5a937bcf924d6b787ce37c6da9f0f54674da9
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 22:11:04 2024 +0800

    Add reproduce envs

commit 3c741a5
Author: kcz358 <92624596+kcz358@users.noreply.github.com>
Date:   Sun Mar 3 21:19:15 2024 +0800

    [Fix] wandb group logging missing columns (EvolvingLMMs-Lab#61)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

    * Update README.md with new features and installation instructions

    * Update supported models and datasets

    * Delete otter.py file

    * Fix capitalization in README.md

    * Update image sizes and add new features

    * Refactor README.md to improve readability and add new features

    * Add description for lmms-eval in README.md

    * Update accelerator support in README.md

    * Update lmms-eval README with improved description and additional features

    * Update README.md with improved task grouping description

    * change `Otter-AI/MME` to `lmms-lab/MME`

    * Update README.md

    * Update README.md

    * Remove unused code in mme.yaml

    * Squashed commit of the following:

    commit a68962a
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Thu Feb 29 13:40:02 2024 +0800

        Dev/py add models (EvolvingLMMs-Lab#57)

        * add instructblip

        * minicpm_v

        * remove <image> from qwen-vl

        * speed up postprocessing

        * Optimize build context speed

        ---------

        Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 0b02105
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Wed Feb 28 14:49:07 2024 +0800

        Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

        * refactor vizwizvqa task

        * Delete vqav2_test and vqav2_val YAML files

        * Refactor vqav2_process_results functions

        * Add a pack for vqav2

        * refactor okvqa

        * roll back vizwiz_vqa

        * Fix exact_match calculation in ok_vqa_process_results

        * Update OKVQA dataset name in readme

        * add model_specific_prompt_kwargs

        * add model_specific_prompt_kwargs to vizwiz_vqa

        * add model_specific_prompt_kwargs for vqav2

        * lint

        * fix a small bug for eval_logger

        * Refactor make_table function to display points as "  -  " if value is None

        * Merge commit '5e73e8b8a2408bd8193361788669ca80db19cb04'

        * Refactor ok_vqa_aggreate_submissions function

        * Merge commit '40099e8b8145bde513b9b7cef8461d8f13d1eafe'

        * Refactor VQA submission file saving

        * Update file utils

        * Merge commit 'a56fe11c00ad4a8b8967be88b93baef6649528c5'

        * Refactor file path handling and submission generation

        * OKVQA path

        * vizwizvqa file

        * pack cmmmu

        * fix a small metric bug for cmmmu

        * Add higher_is_better flag to submission metric

        * Add CMMMU dataset to README.md

        * Add logging and refactor submission file generation in docvqa utils.py

        * pack docvqa

        * add traceback to print detailed error

        * Refactor docvqa_test_aggregate_results to accept additional arguments

        * Add metric check in evaluator.py and update test.yaml and val.yaml

        * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

        * merge textvqa

        * textvqa

        * Modify submission file generation for COCO test results

        * Update test result storage path

        * update coco cap file name

        * Update COCO 2017 Caption dataset name

        * ferret

        * Add Ferret dataset

        * Refactor hb_doc_to_text function to include model-specific prompts

        * Add IconQA and its subtasks

        * Refactor image list creation in doc_to_visual function

        * Add process_results function to default template

        * Update process_results function in iconqa utils.py

        * refactor flickr30k

        * change aggregation function

        * Fix formatting issues and update logging message

        * Fix llava can not handle only text question (no visuals)

        * Fix qwen can not handle no image question (no visuals)

        * Add fuyu prepare accelerator scripts

        * refactor mme

        * naming consistency

        * aggregation_submissions consistency

        * flickr30k naming consistency

        * remove submissions for mme

        * remove unused submission function

        * Refactor infovqa_test.yaml and infovqa_val.yaml

        * Refactor code for improved readability and maintainability

        * stvqa

        * remane sqa

        * Update lmms_eval textcaps files and utils.py

        * Update default prompt for text captions

        * Refactor textcaps_aggregation_result function

        * Add generate_submission_file function and update mathvista_aggregate_results signature

        * Update nocaps_test.yaml and nocaps_val.yaml

        * refractor internal_eval

        * Add internal evaluation datasets

        * pack multidocvqa

        * mmvet

        * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

        * Refractor llava wild

        * Refractor llava-bench-coco

        * Add JSON file generation for gpt evaluation details

        * mmmu

        * Remove MMBench English and Chinese tasks

        * Remove unnecessary return statement in mmbench_aggregate_test_results function

        * Fix distributed process group initialization

        * Update dataset paths and group names in mmbench test configs

        * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

        * Add torch module import

        * lint

        * Remove IconQA dataset from README.md

        * Add Multi-DocVQA and its submodules

        * Add new datasets and update task names

        * Refactor flickr_aggregation_result function to accept additional arguments

        * Add timeout kwargs in Accelerator constructor

        * Add encoding to be utf-8 for cmmmu

        * Fix llava try and catch, remove torch.distributed.init in main

        * Ds prepare script for llava

        ---------

        Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit f4af7d0
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Feb 27 22:52:07 2024 +0800

        [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

        * Refactor logging in lmms_eval package

        * Refactor variable names in lmms_eval package

    * add llava main in pyproject

    * Update README.md

    * Remove unnecessary dependencies and add specific version for llava_repr

    * Add dependencies for llava_repr***

    * Update README.md

    * add some docs on models and command line commands

    * remove some lines

    * typo

    * Update model_guide.md

    * Update model_guide.md

    * Update README.md

    * Update README.md

    * Update README.md

    * Fix refcocog dataset path

    * Record gpt response in eval info

    * Resolve conflict

    * Fix hallusionbench gpt json saving path

    * Rename hallubench gpt output path

    * Change remove image to check by type instead of check by names

    * More robust check by type

    * Remove unnecessary img in data

    * Forcing an empty commit.

    * Testing

    * Delete unnecessary things

    * Fix seedbench2 image issue in doc_to_text

    * Add conditional exclude for internal eval

    * Fix small bugs in list_with_num

    * Revise list_with_num model args

    * Fix logging utils bug on wandb grouping

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Co-authored-by: jzhang38 <a1286225768@gmail.com>

commit 39ce670fb1992c5e30d4b0eff9636a88a1ce83f5
Merge: 83358a4 5e1c9c7
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 07:25:48 2024 +0000

    Merge branch 'main' into kc/final_fix

commit 36eeaa08730cd3e6a7e90e7000f61b4ebb075524
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 07:23:19 2024 +0000

    Fix logging utils bug on wandb grouping

commit 9eb42de
Author: kcz358 <92624596+kcz358@users.noreply.github.com>
Date:   Sun Mar 3 13:01:11 2024 +0800

    [Fix] refcocog dataset path, record gpt prompt in internal eval, build context issue (EvolvingLMMs-Lab#59)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

    * Update README.md with new features and installation instructions

    * Update supported models and datasets

    * Delete otter.py file

    * Fix capitalization in README.md

    * Update image sizes and add new features

    * Refactor README.md to improve readability and add new features

    * Add description for lmms-eval in README.md

    * Update accelerator support in README.md

    * Update lmms-eval README with improved description and additional features

    * Update README.md with improved task grouping description

    * change `Otter-AI/MME` to `lmms-lab/MME`

    * Update README.md

    * Update README.md

    * Remove unused code in mme.yaml

    * Squashed commit of the following:

    commit a68962a
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Thu Feb 29 13:40:02 2024 +0800

        Dev/py add models (EvolvingLMMs-Lab#57)

        * add instructblip

        * minicpm_v

        * remove <image> from qwen-vl

        * speed up postprocessing

        * Optimize build context speed

        ---------

        Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 0b02105
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Wed Feb 28 14:49:07 2024 +0800

        Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

        * refactor vizwizvqa task

        * Delete vqav2_test and vqav2_val YAML files

        * Refactor vqav2_process_results functions

        * Add a pack for vqav2

        * refactor okvqa

        * roll back vizwiz_vqa

        * Fix exact_match calculation in ok_vqa_process_results

        * Update OKVQA dataset name in readme

        * add model_specific_prompt_kwargs

        * add model_specific_prompt_kwargs to vizwiz_vqa

        * add model_specific_prompt_kwargs for vqav2

        * lint

        * fix a small bug for eval_logger

        * Refactor make_table function to display points as "  -  " if value is None

        * Merge commit '5e73e8b8a2408bd8193361788669ca80db19cb04'

        * Refactor ok_vqa_aggreate_submissions function

        * Merge commit '40099e8b8145bde513b9b7cef8461d8f13d1eafe'

        * Refactor VQA submission file saving

        * Update file utils

        * Merge commit 'a56fe11c00ad4a8b8967be88b93baef6649528c5'

        * Refactor file path handling and submission generation

        * OKVQA path

        * vizwizvqa file

        * pack cmmmu

        * fix a small metric bug for cmmmu

        * Add higher_is_better flag to submission metric

        * Add CMMMU dataset to README.md

        * Add logging and refactor submission file generation in docvqa utils.py

        * pack docvqa

        * add traceback to print detailed error

        * Refactor docvqa_test_aggregate_results to accept additional arguments

        * Add metric check in evaluator.py and update test.yaml and val.yaml

        * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

        * merge textvqa

        * textvqa

        * Modify submission file generation for COCO test results

        * Update test result storage path

        * update coco cap file name

        * Update COCO 2017 Caption dataset name

        * ferret

        * Add Ferret dataset

        * Refactor hb_doc_to_text function to include model-specific prompts

        * Add IconQA and its subtasks

        * Refactor image list creation in doc_to_visual function

        * Add process_results function to default template

        * Update process_results function in iconqa utils.py

        * refactor flickr30k

        * change aggregation function

        * Fix formatting issues and update logging message

        * Fix llava can not handle only text question (no visuals)

        * Fix qwen can not handle no image question (no visuals)

        * Add fuyu prepare accelerator scripts

        * refactor mme

        * naming consistency

        * aggregation_submissions consistency

        * flickr30k naming consistency

        * remove submissions for mme

        * remove unused submission function

        * Refactor infovqa_test.yaml and infovqa_val.yaml

        * Refactor code for improved readability and maintainability

        * stvqa

        * remane sqa

        * Update lmms_eval textcaps files and utils.py

        * Update default prompt for text captions

        * Refactor textcaps_aggregation_result function

        * Add generate_submission_file function and update mathvista_aggregate_results signature

        * Update nocaps_test.yaml and nocaps_val.yaml

        * refractor internal_eval

        * Add internal evaluation datasets

        * pack multidocvqa

        * mmvet

        * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

        * Refractor llava wild

        * Refractor llava-bench-coco

        * Add JSON file generation for gpt evaluation details

        * mmmu

        * Remove MMBench English and Chinese tasks

        * Remove unnecessary return statement in mmbench_aggregate_test_results function

        * Fix distributed process group initialization

        * Update dataset paths and group names in mmbench test configs

        * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

        * Add torch module import

        * lint

        * Remove IconQA dataset from README.md

        * Add Multi-DocVQA and its submodules

        * Add new datasets and update task names

        * Refactor flickr_aggregation_result function to accept additional arguments

        * Add timeout kwargs in Accelerator constructor

        * Add encoding to be utf-8 for cmmmu

        * Fix llava try and catch, remove torch.distributed.init in main

        * Ds prepare script for llava

        ---------

        Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit f4af7d0
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Feb 27 22:52:07 2024 +0800

        [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

        * Refactor logging in lmms_eval package

        * Refactor variable names in lmms_eval package

    * add llava main in pyproject

    * Update README.md

    * Remove unnecessary dependencies and add specific version for llava_repr

    * Add dependencies for llava_repr***

    * Update README.md

    * add some docs on models and command line commands

    * remove some lines

    * typo

    * Update model_guide.md

    * Update model_guide.md

    * Update README.md

    * Update README.md

    * Update README.md

    * Fix refcocog dataset path

    * Record gpt response in eval info

    * Resolve conflict

    * Fix hallusionbench gpt json saving path

    * Rename hallubench gpt output path

    * Change remove image to check by type instead of check by names

    * More robust check by type

    * Remove unnecessary img in data

    * Forcing an empty commit.

    * Testing

    * Delete unnecessary things

    * Fix seedbench2 image issue in doc_to_text

    * Add conditional exclude for internal eval

    * Fix small bugs in list_with_num

    * Revise list_with_num model args

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Co-authored-by: jzhang38 <a1286225768@gmail.com>

commit 22fda28d8aa2a53405f15d179ea9baaf53a19b0b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 05:58:08 2024 +0000

    Revise list_with_num model args

commit 48d92eb823b7929ea4c7b0da9f2284ec194c71cf
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 05:09:15 2024 +0000

    Fix small bugs in list_with_num

commit 1cf38b3ad6c7799957901d836299243cc21718f5
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:49:36 2024 +0000

    Add conditional exclude for internal eval

commit 62527c874431508b7731ad49ff1f1526104703cd
Merge: a3cae8e ffb9eb2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:24:29 2024 +0000

    Merge branch 'dev/readme' into kc/final_fix

commit 522f36aca8354f5efa7fff6d23bd90e885bcf1ab
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 02:47:31 2024 +0000

    Fix seedbench2 image issue in doc_to_text

commit 4ee323a5b19382dbd9ba62f5002042d0746c374e
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:32:49 2024 +0000

    Delete unnecessary things

commit 3d3e164489cb4bd2db342ae085da9613ee7de660
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:31:42 2024 +0000

    Testing

commit 8a4f586d7232a4d89977cef140900728d4517b72
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:29:30 2024 +0000

    Forcing an empty commit.

commit 33dd5b0e0006882e735b7ea1908fdb6ad37c825a
Merge: 786f2b5 1700786
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:56 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit f19de3e7aaf5151d5ce9c63a2b9ee393c6282dfa
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:20 2024 +0000

    Remove unnecessary img in data

commit e1f8cad15ddc2e385a3f2a778a4af57e1072987c
Merge: 4240785 888c1c1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:41:24 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 472b6b1ed2d5bc10ff1d6df8e435f33dc821ad4b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:40:51 2024 +0000

    More robust check by type

commit 367c021bd50068baf024bea3afde4ed58aa38b81
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:00:57 2024 +0000

    Change remove image to check by type instead of check by names

commit 0a466e16d983392cbf0580733500c0890521df93
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 12:33:02 2024 +0000

    Rename hallubench gpt output path

commit 6feceda2c1d631243c78fd7805dcdde4d0e8912f
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 09:32:52 2024 +0000

    Fix hallusionbench gpt json saving path

commit db1f731ee5aff4618edefed018e982f83add0c9a
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:51:13 2024 +0000

    Resolve conflict

commit c8a5e1129310ed1ce1fd86f43bb49da701140383
Merge: 9cf86fa 93534dc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:37:21 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit de53ceaeff08dc7c01962c704e06d7b87f804ec7
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:55:03 2024 +0000

    Record gpt response in eval info

commit e372631e911f2e03cc4f579e291e1198c4c11298
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:49:01 2024 +0000

    Fix refcocog dataset path

commit a68962a
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 0b02105
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit '5e73e8b8a2408bd8193361788669ca80db19cb04'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit '40099e8b8145bde513b9b7cef8461d8f13d1eafe'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit 'a56fe11c00ad4a8b8967be88b93baef6649528c5'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit f4af7d0
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* Update commands.md

* Add repr_scripts for reference

* Add timeout for gpt4V

* Remove unnecessary dependencies

* Add reproduce into readme

* Revise seedbench process_result

* Fix exclude dc hardcode postprocess logic error

* Fix metric repeat issue

* Update dataset runtime and add environment info

* Revise val submission file saving path

* Put the correct query into the gpt extraction

* Update sleep time in utils.py

* update

---------

Co-authored-by: Bo Li <drluodian@gmail.com>
Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
Co-authored-by: jzhang38 <a1286225768@gmail.com>
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
…volvingLMMs-Lab#62)

* Refactor logging in lmms_eval package

* Refactor variable names in lmms_eval package

* Update README.md with new features and installation instructions

* Update supported models and datasets

* Delete otter.py file

* Fix capitalization in README.md

* Update image sizes and add new features

* Refactor README.md to improve readability and add new features

* Add description for lmms-eval in README.md

* Update accelerator support in README.md

* Update lmms-eval README with improved description and additional features

* Update README.md with improved task grouping description

* change `Otter-AI/MME` to `lmms-lab/MME`

* Update README.md

* Update README.md

* Remove unused code in mme.yaml

* Squashed commit of the following:

commit 2782eb0
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 7e8d3e4
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 4fa73ba
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* add llava main in pyproject

* Update README.md

* Remove unnecessary dependencies and add specific version for llava_repr

* Add dependencies for llava_repr***

* Update README.md

* add some docs on models and command line commands

* remove some lines

* typo

* Update model_guide.md

* Update model_guide.md

* Update README.md

* Update README.md

* Update README.md

* Fix refcocog dataset path

* Record gpt response in eval info

* Resolve conflict

* Fix hallusionbench gpt json saving path

* Rename hallubench gpt output path

* Change remove image to check by type instead of check by names

* More robust check by type

* Add timeout to API requests

* Remove unnecessary img in data

* Forcing an empty commit.

* Testing

* Delete unnecessary things

* Fix error logging in get_chat_response function

* Fix seedbench2 image issue in doc_to_text

* Add conditional exclude for internal eval

* Squashed commit of the following:

commit e873012d0da2711f2076f7c09f390901f89da2f9
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:49:36 2024 +0000

    Add conditional exclude for internal eval

commit 621cdd663e0197827a5792872f13cdf3d27d2813
Merge: a3cae8e ffb9eb2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:24:29 2024 +0000

    Merge branch 'dev/readme' into kc/final_fix

commit 6daf75c54fe3d45970c5d35a10000f10c1420c6b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 02:47:31 2024 +0000

    Fix seedbench2 image issue in doc_to_text

commit 2a7a03205a2514fe0322ab4aa05c4948f9233109
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:32:49 2024 +0000

    Delete unnecessary things

commit a99850057224596d01835fface39d4aafd79de3e
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:31:42 2024 +0000

    Testing

commit 42f5fc125c7ee7d31633647f29f0d02ed3e640a8
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:29:30 2024 +0000

    Forcing an empty commit.

commit dddd0276003115c8a150a78eb3ae7bd299c460e4
Merge: 786f2b5 1700786
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:56 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit bcffe0b45083f48886e18d5ece5f2504b96bbcbd
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:20 2024 +0000

    Remove unnecessary img in data

commit f6705996b992363f2fd3c5dedb90e1bd51d04426
Merge: 4240785 888c1c1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:41:24 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 9290fc1c27ecca86f7ec3df0d932c7fa228e19c9
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:40:51 2024 +0000

    More robust check by type

commit 2fceaaf8f855d08d642996cd217ec0f6fc0fa04c
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:00:57 2024 +0000

    Change remove image to check by type instead of check by names

commit 33c0a81c91733e9aabe214f0797be2fdd3df1f1c
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 12:33:02 2024 +0000

    Rename hallubench gpt output path

commit 90ad0ace136a35ecc16a09ce841736842f7eb6dd
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 09:32:52 2024 +0000

    Fix hallusionbench gpt json saving path

commit 15b0336a932ef1823696e63672837700ce4fdae9
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:51:13 2024 +0000

    Resolve conflict

commit f75e7cfd35b1ee814f86abb9d4fbace027c00941
Merge: 9cf86fa 93534dc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:37:21 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 06c51ea7682e31964ca720a8a40705a3a7f3f360
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:55:03 2024 +0000

    Record gpt response in eval info

commit cdf7e6f77f7b6eee960e01e80c00ec74b8c1fbe7
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:49:01 2024 +0000

    Fix refcocog dataset path

commit 2782eb0
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 7e8d3e4
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 4fa73ba
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* Fix small bugs in list_with_num

* Revise list_with_num model args

* Dev/readme rm rolling (EvolvingLMMs-Lab#60)

* remove log_likelyhood_rolling

* Update time efficiency benchmark in README.md

* add task guide

---------

Co-authored-by: jzhang38 <a1286225768@gmail.com>
Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>

* Remove unnecessary code and update dependencies

* Fix logging utils bug on wandb grouping

* Add reproduce envs

* Squashed commit of the following:

commit bf49a3e1de8431193bdf6f7688a4ff7f4683a84d
Merge: 2475639 f89a736
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 22:12:12 2024 +0800

    Merge branch 'main' into kc/final_fix

commit b535df91bc792b3b2b296572ec4692c75fdfe878
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 22:11:04 2024 +0800

    Add reproduce envs

commit d0539a0
Author: kcz358 <92624596+kcz358@users.noreply.github.com>
Date:   Sun Mar 3 21:19:15 2024 +0800

    [Fix] wandb group logging missing columns (EvolvingLMMs-Lab#61)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

    * Update README.md with new features and installation instructions

    * Update supported models and datasets

    * Delete otter.py file

    * Fix capitalization in README.md

    * Update image sizes and add new features

    * Refactor README.md to improve readability and add new features

    * Add description for lmms-eval in README.md

    * Update accelerator support in README.md

    * Update lmms-eval README with improved description and additional features

    * Update README.md with improved task grouping description

    * change `Otter-AI/MME` to `lmms-lab/MME`

    * Update README.md

    * Update README.md

    * Remove unused code in mme.yaml

    * Squashed commit of the following:

    commit 2782eb0
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Thu Feb 29 13:40:02 2024 +0800

        Dev/py add models (EvolvingLMMs-Lab#57)

        * add instructblip

        * minicpm_v

        * remove <image> from qwen-vl

        * speed up postprocessing

        * Optimize build context speed

        ---------

        Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 7e8d3e4
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Wed Feb 28 14:49:07 2024 +0800

        Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

        * refactor vizwizvqa task

        * Delete vqav2_test and vqav2_val YAML files

        * Refactor vqav2_process_results functions

        * Add a pack for vqav2

        * refactor okvqa

        * roll back vizwiz_vqa

        * Fix exact_match calculation in ok_vqa_process_results

        * Update OKVQA dataset name in readme

        * add model_specific_prompt_kwargs

        * add model_specific_prompt_kwargs to vizwiz_vqa

        * add model_specific_prompt_kwargs for vqav2

        * lint

        * fix a small bug for eval_logger

        * Refactor make_table function to display points as "  -  " if value is None

        * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

        * Refactor ok_vqa_aggreate_submissions function

        * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

        * Refactor VQA submission file saving

        * Update file utils

        * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

        * Refactor file path handling and submission generation

        * OKVQA path

        * vizwizvqa file

        * pack cmmmu

        * fix a small metric bug for cmmmu

        * Add higher_is_better flag to submission metric

        * Add CMMMU dataset to README.md

        * Add logging and refactor submission file generation in docvqa utils.py

        * pack docvqa

        * add traceback to print detailed error

        * Refactor docvqa_test_aggregate_results to accept additional arguments

        * Add metric check in evaluator.py and update test.yaml and val.yaml

        * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

        * merge textvqa

        * textvqa

        * Modify submission file generation for COCO test results

        * Update test result storage path

        * update coco cap file name

        * Update COCO 2017 Caption dataset name

        * ferret

        * Add Ferret dataset

        * Refactor hb_doc_to_text function to include model-specific prompts

        * Add IconQA and its subtasks

        * Refactor image list creation in doc_to_visual function

        * Add process_results function to default template

        * Update process_results function in iconqa utils.py

        * refactor flickr30k

        * change aggregation function

        * Fix formatting issues and update logging message

        * Fix llava can not handle only text question (no visuals)

        * Fix qwen can not handle no image question (no visuals)

        * Add fuyu prepare accelerator scripts

        * refactor mme

        * naming consistency

        * aggregation_submissions consistency

        * flickr30k naming consistency

        * remove submissions for mme

        * remove unused submission function

        * Refactor infovqa_test.yaml and infovqa_val.yaml

        * Refactor code for improved readability and maintainability

        * stvqa

        * remane sqa

        * Update lmms_eval textcaps files and utils.py

        * Update default prompt for text captions

        * Refactor textcaps_aggregation_result function

        * Add generate_submission_file function and update mathvista_aggregate_results signature

        * Update nocaps_test.yaml and nocaps_val.yaml

        * refractor internal_eval

        * Add internal evaluation datasets

        * pack multidocvqa

        * mmvet

        * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

        * Refractor llava wild

        * Refractor llava-bench-coco

        * Add JSON file generation for gpt evaluation details

        * mmmu

        * Remove MMBench English and Chinese tasks

        * Remove unnecessary return statement in mmbench_aggregate_test_results function

        * Fix distributed process group initialization

        * Update dataset paths and group names in mmbench test configs

        * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

        * Add torch module import

        * lint

        * Remove IconQA dataset from README.md

        * Add Multi-DocVQA and its submodules

        * Add new datasets and update task names

        * Refactor flickr_aggregation_result function to accept additional arguments

        * Add timeout kwargs in Accelerator constructor

        * Add encoding to be utf-8 for cmmmu

        * Fix llava try and catch, remove torch.distributed.init in main

        * Ds prepare script for llava

        ---------

        Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 4fa73ba
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Feb 27 22:52:07 2024 +0800

        [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

        * Refactor logging in lmms_eval package

        * Refactor variable names in lmms_eval package

    * add llava main in pyproject

    * Update README.md

    * Remove unnecessary dependencies and add specific version for llava_repr

    * Add dependencies for llava_repr***

    * Update README.md

    * add some docs on models and command line commands

    * remove some lines

    * typo

    * Update model_guide.md

    * Update model_guide.md

    * Update README.md

    * Update README.md

    * Update README.md

    * Fix refcocog dataset path

    * Record gpt response in eval info

    * Resolve conflict

    * Fix hallusionbench gpt json saving path

    * Rename hallubench gpt output path

    * Change remove image to check by type instead of check by names

    * More robust check by type

    * Remove unnecessary img in data

    * Forcing an empty commit.

    * Testing

    * Delete unnecessary things

    * Fix seedbench2 image issue in doc_to_text

    * Add conditional exclude for internal eval

    * Fix small bugs in list_with_num

    * Revise list_with_num model args

    * Fix logging utils bug on wandb grouping

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Co-authored-by: jzhang38 <a1286225768@gmail.com>

commit 7dc049915a1846177e0f9f8eab12366881f82157
Merge: 83358a4 5e1c9c7
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 07:25:48 2024 +0000

    Merge branch 'main' into kc/final_fix

commit 5ec98efc7b666341adc726b8d1d4779b6c543f7f
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 07:23:19 2024 +0000

    Fix logging utils bug on wandb grouping

commit 105d781
Author: kcz358 <92624596+kcz358@users.noreply.github.com>
Date:   Sun Mar 3 13:01:11 2024 +0800

    [Fix] refcocog dataset path, record gpt prompt in internal eval, build context issue (EvolvingLMMs-Lab#59)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

    * Update README.md with new features and installation instructions

    * Update supported models and datasets

    * Delete otter.py file

    * Fix capitalization in README.md

    * Update image sizes and add new features

    * Refactor README.md to improve readability and add new features

    * Add description for lmms-eval in README.md

    * Update accelerator support in README.md

    * Update lmms-eval README with improved description and additional features

    * Update README.md with improved task grouping description

    * change `Otter-AI/MME` to `lmms-lab/MME`

    * Update README.md

    * Update README.md

    * Remove unused code in mme.yaml

    * Squashed commit of the following:

    commit 2782eb0
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Thu Feb 29 13:40:02 2024 +0800

        Dev/py add models (EvolvingLMMs-Lab#57)

        * add instructblip

        * minicpm_v

        * remove <image> from qwen-vl

        * speed up postprocessing

        * Optimize build context speed

        ---------

        Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 7e8d3e4
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Wed Feb 28 14:49:07 2024 +0800

        Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

        * refactor vizwizvqa task

        * Delete vqav2_test and vqav2_val YAML files

        * Refactor vqav2_process_results functions

        * Add a pack for vqav2

        * refactor okvqa

        * roll back vizwiz_vqa

        * Fix exact_match calculation in ok_vqa_process_results

        * Update OKVQA dataset name in readme

        * add model_specific_prompt_kwargs

        * add model_specific_prompt_kwargs to vizwiz_vqa

        * add model_specific_prompt_kwargs for vqav2

        * lint

        * fix a small bug for eval_logger

        * Refactor make_table function to display points as "  -  " if value is None

        * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

        * Refactor ok_vqa_aggreate_submissions function

        * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

        * Refactor VQA submission file saving

        * Update file utils

        * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

        * Refactor file path handling and submission generation

        * OKVQA path

        * vizwizvqa file

        * pack cmmmu

        * fix a small metric bug for cmmmu

        * Add higher_is_better flag to submission metric

        * Add CMMMU dataset to README.md

        * Add logging and refactor submission file generation in docvqa utils.py

        * pack docvqa

        * add traceback to print detailed error

        * Refactor docvqa_test_aggregate_results to accept additional arguments

        * Add metric check in evaluator.py and update test.yaml and val.yaml

        * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

        * merge textvqa

        * textvqa

        * Modify submission file generation for COCO test results

        * Update test result storage path

        * update coco cap file name

        * Update COCO 2017 Caption dataset name

        * ferret

        * Add Ferret dataset

        * Refactor hb_doc_to_text function to include model-specific prompts

        * Add IconQA and its subtasks

        * Refactor image list creation in doc_to_visual function

        * Add process_results function to default template

        * Update process_results function in iconqa utils.py

        * refactor flickr30k

        * change aggregation function

        * Fix formatting issues and update logging message

        * Fix llava can not handle only text question (no visuals)

        * Fix qwen can not handle no image question (no visuals)

        * Add fuyu prepare accelerator scripts

        * refactor mme

        * naming consistency

        * aggregation_submissions consistency

        * flickr30k naming consistency

        * remove submissions for mme

        * remove unused submission function

        * Refactor infovqa_test.yaml and infovqa_val.yaml

        * Refactor code for improved readability and maintainability

        * stvqa

        * remane sqa

        * Update lmms_eval textcaps files and utils.py

        * Update default prompt for text captions

        * Refactor textcaps_aggregation_result function

        * Add generate_submission_file function and update mathvista_aggregate_results signature

        * Update nocaps_test.yaml and nocaps_val.yaml

        * refractor internal_eval

        * Add internal evaluation datasets

        * pack multidocvqa

        * mmvet

        * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

        * Refractor llava wild

        * Refractor llava-bench-coco

        * Add JSON file generation for gpt evaluation details

        * mmmu

        * Remove MMBench English and Chinese tasks

        * Remove unnecessary return statement in mmbench_aggregate_test_results function

        * Fix distributed process group initialization

        * Update dataset paths and group names in mmbench test configs

        * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

        * Add torch module import

        * lint

        * Remove IconQA dataset from README.md

        * Add Multi-DocVQA and its submodules

        * Add new datasets and update task names

        * Refactor flickr_aggregation_result function to accept additional arguments

        * Add timeout kwargs in Accelerator constructor

        * Add encoding to be utf-8 for cmmmu

        * Fix llava try and catch, remove torch.distributed.init in main

        * Ds prepare script for llava

        ---------

        Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 4fa73ba
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Feb 27 22:52:07 2024 +0800

        [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

        * Refactor logging in lmms_eval package

        * Refactor variable names in lmms_eval package

    * add llava main in pyproject

    * Update README.md

    * Remove unnecessary dependencies and add specific version for llava_repr

    * Add dependencies for llava_repr***

    * Update README.md

    * add some docs on models and command line commands

    * remove some lines

    * typo

    * Update model_guide.md

    * Update model_guide.md

    * Update README.md

    * Update README.md

    * Update README.md

    * Fix refcocog dataset path

    * Record gpt response in eval info

    * Resolve conflict

    * Fix hallusionbench gpt json saving path

    * Rename hallubench gpt output path

    * Change remove image to check by type instead of check by names

    * More robust check by type

    * Remove unnecessary img in data

    * Forcing an empty commit.

    * Testing

    * Delete unnecessary things

    * Fix seedbench2 image issue in doc_to_text

    * Add conditional exclude for internal eval

    * Fix small bugs in list_with_num

    * Revise list_with_num model args

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Co-authored-by: jzhang38 <a1286225768@gmail.com>

commit 8263ca91c87a127d992dd01bdac5f89b8a5ff521
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 05:58:08 2024 +0000

    Revise list_with_num model args

commit c413569d46be0ad604cd249df8bd58ffe26c0e39
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 05:09:15 2024 +0000

    Fix small bugs in list_with_num

commit e873012d0da2711f2076f7c09f390901f89da2f9
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:49:36 2024 +0000

    Add conditional exclude for internal eval

commit 621cdd663e0197827a5792872f13cdf3d27d2813
Merge: a3cae8e ffb9eb2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:24:29 2024 +0000

    Merge branch 'dev/readme' into kc/final_fix

commit 6daf75c54fe3d45970c5d35a10000f10c1420c6b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 02:47:31 2024 +0000

    Fix seedbench2 image issue in doc_to_text

commit 2a7a03205a2514fe0322ab4aa05c4948f9233109
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:32:49 2024 +0000

    Delete unnecessary things

commit a99850057224596d01835fface39d4aafd79de3e
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:31:42 2024 +0000

    Testing

commit 42f5fc125c7ee7d31633647f29f0d02ed3e640a8
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:29:30 2024 +0000

    Forcing an empty commit.

commit dddd0276003115c8a150a78eb3ae7bd299c460e4
Merge: 786f2b5 1700786
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:56 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit bcffe0b45083f48886e18d5ece5f2504b96bbcbd
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:20 2024 +0000

    Remove unnecessary img in data

commit f6705996b992363f2fd3c5dedb90e1bd51d04426
Merge: 4240785 888c1c1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:41:24 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 9290fc1c27ecca86f7ec3df0d932c7fa228e19c9
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:40:51 2024 +0000

    More robust check by type

commit 2fceaaf8f855d08d642996cd217ec0f6fc0fa04c
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:00:57 2024 +0000

    Change remove image to check by type instead of check by names

commit 33c0a81c91733e9aabe214f0797be2fdd3df1f1c
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 12:33:02 2024 +0000

    Rename hallubench gpt output path

commit 90ad0ace136a35ecc16a09ce841736842f7eb6dd
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 09:32:52 2024 +0000

    Fix hallusionbench gpt json saving path

commit 15b0336a932ef1823696e63672837700ce4fdae9
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:51:13 2024 +0000

    Resolve conflict

commit f75e7cfd35b1ee814f86abb9d4fbace027c00941
Merge: 9cf86fa 93534dc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:37:21 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 06c51ea7682e31964ca720a8a40705a3a7f3f360
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:55:03 2024 +0000

    Record gpt response in eval info

commit cdf7e6f77f7b6eee960e01e80c00ec74b8c1fbe7
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:49:01 2024 +0000

    Fix refcocog dataset path

commit 2782eb0
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 7e8d3e4
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 4fa73ba
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* Update commands.md

* Add repr_scripts for reference

* Add timeout for gpt4V

* Remove unnecessary dependencies

* Add reproduce into readme

* Revise seedbench process_result

* Fix exclude dc hardcode postprocess logic error

* Fix metric repeat issue

* Update dataset runtime and add environment info

* Revise val submission file saving path

* Put the correct query into the gpt extraction

* Update sleep time in utils.py

* update

---------

Co-authored-by: Bo Li <drluodian@gmail.com>
Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
Co-authored-by: jzhang38 <a1286225768@gmail.com>
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
* Refactor logging in lmms_eval package

* Refactor variable names in lmms_eval package

* Update README.md with new features and installation instructions

* Update supported models and datasets

* Delete otter.py file

* Fix capitalization in README.md

* Update image sizes and add new features

* Refactor README.md to improve readability and add new features

* Add description for lmms-eval in README.md

* Update accelerator support in README.md

* Update lmms-eval README with improved description and additional features

* Update README.md with improved task grouping description

* change `Otter-AI/MME` to `lmms-lab/MME`

* Update README.md

* Update README.md

* Remove unused code in mme.yaml

* Squashed commit of the following:

commit 5c6e0c8
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 8bd568e
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit '63fc8eee4dddfbe741e5a862e5ff30d19c34238e'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'd16bbce134d453c624834e090af1e0f869fdde15'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '7332704263a45ab6fa69aad0c4303cd9cbc26813'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 0e0c698
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* add llava main in pyproject

* Update README.md

* Remove unnecessary dependencies and add specific version for llava_repr

* Add dependencies for llava_repr***

* Update README.md

* add some docs on models and command line commands

* remove some lines

* typo

* Update model_guide.md

* Update model_guide.md

* Update README.md

* Update README.md

* Update README.md

* Fix refcocog dataset path

* Record gpt response in eval info

* Resolve conflict

* Fix hallusionbench gpt json saving path

* Rename hallubench gpt output path

* Change remove image to check by type instead of check by names

* More robust check by type

* Add timeout to API requests

* Remove unnecessary img in data

* Forcing an empty commit.

* Testing

* Delete unnecessary things

* Fix error logging in get_chat_response function

* Fix seedbench2 image issue in doc_to_text

* Add conditional exclude for internal eval

* Squashed commit of the following:

commit 584db7fcc0140dd4a6d6481529ae90570b2912c4
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:49:36 2024 +0000

    Add conditional exclude for internal eval

commit 5e52a8df3785eb2d1b392eb164b66e92c9dadb02
Merge: a3cae8e ffb9eb2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:24:29 2024 +0000

    Merge branch 'dev/readme' into kc/final_fix

commit a3cae8e9f3570121d51885c71f7081da36c5d13d
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 02:47:31 2024 +0000

    Fix seedbench2 image issue in doc_to_text

commit 0b3cad596fd58e6414ea015e79bef1eea6eb7f7a
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:32:49 2024 +0000

    Delete unnecessary things

commit f436cb65bd716d93044516ece2133ab5b8d87137
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:31:42 2024 +0000

    Testing

commit 3d47b59f92cef22cfe38e00b407ce38a61d538b2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:29:30 2024 +0000

    Forcing an empty commit.

commit ffb9eb26dae25cda1e0d3e302852862102b47054
Merge: 786f2b5 1700786
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:56 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 1700786b572cbedcb6969ae97028225d388987bb
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:20 2024 +0000

    Remove unnecessary img in data

commit 786f2b53d57265b9900b0718d27538221b5f81b4
Merge: 4240785 888c1c1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:41:24 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 888c1c128319bd04528727a309d0d92aaee9e752
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:40:51 2024 +0000

    More robust check by type

commit 8c74caa2f77940c781501b45571d7c6362c9a6c8
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:00:57 2024 +0000

    Change remove image to check by type instead of check by names

commit 4ab5cc32e3a460ad112dcd3031cea55b6bc0f691
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 12:33:02 2024 +0000

    Rename hallubench gpt output path

commit eae08b536908875eeb600538e853caaa14c655ae
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 09:32:52 2024 +0000

    Fix hallusionbench gpt json saving path

commit 4240785c1bf3a7fd15f36013803c004542a17f2e
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:51:13 2024 +0000

    Resolve conflict

commit c15bf75d2f76e215b4d5de43c1d17b5a41d79753
Merge: 9cf86fa 93534dc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:37:21 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 93534dc4e98b78b9da01099079187d8960705fb8
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:55:03 2024 +0000

    Record gpt response in eval info

commit 05166a14c45063bf108282c3202d32feb2fe0afa
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:49:01 2024 +0000

    Fix refcocog dataset path

commit 5c6e0c8
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 8bd568e
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit '63fc8eee4dddfbe741e5a862e5ff30d19c34238e'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'd16bbce134d453c624834e090af1e0f869fdde15'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '7332704263a45ab6fa69aad0c4303cd9cbc26813'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 0e0c698
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* Fix small bugs in list_with_num

* Revise list_with_num model args

* Dev/readme rm rolling (EvolvingLMMs-Lab#60)

* remove log_likelyhood_rolling

* Update time efficiency benchmark in README.md

* add task guide

---------

Co-authored-by: jzhang38 <a1286225768@gmail.com>
Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>

* Remove unnecessary code and update dependencies

* Fix logging utils bug on wandb grouping

* Add reproduce envs

* Squashed commit of the following:

commit 172a002845728f263a9221206aeab62bdc1070dc
Merge: 2475639 2152f18
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 22:12:12 2024 +0800

    Merge branch 'main' into kc/final_fix

commit 2475639fcf9164a7965b080c31dc50bc856fa053
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 22:11:04 2024 +0800

    Add reproduce envs

commit 2152f18
Author: kcz358 <92624596+kcz358@users.noreply.github.com>
Date:   Sun Mar 3 21:19:15 2024 +0800

    [Fix] wandb group logging missing columns (EvolvingLMMs-Lab#61)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

    * Update README.md with new features and installation instructions

    * Update supported models and datasets

    * Delete otter.py file

    * Fix capitalization in README.md

    * Update image sizes and add new features

    * Refactor README.md to improve readability and add new features

    * Add description for lmms-eval in README.md

    * Update accelerator support in README.md

    * Update lmms-eval README with improved description and additional features

    * Update README.md with improved task grouping description

    * change `Otter-AI/MME` to `lmms-lab/MME`

    * Update README.md

    * Update README.md

    * Remove unused code in mme.yaml

    * Squashed commit of the following:

    commit 5c6e0c8
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Thu Feb 29 13:40:02 2024 +0800

        Dev/py add models (EvolvingLMMs-Lab#57)

        * add instructblip

        * minicpm_v

        * remove <image> from qwen-vl

        * speed up postprocessing

        * Optimize build context speed

        ---------

        Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 8bd568e
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Wed Feb 28 14:49:07 2024 +0800

        Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

        * refactor vizwizvqa task

        * Delete vqav2_test and vqav2_val YAML files

        * Refactor vqav2_process_results functions

        * Add a pack for vqav2

        * refactor okvqa

        * roll back vizwiz_vqa

        * Fix exact_match calculation in ok_vqa_process_results

        * Update OKVQA dataset name in readme

        * add model_specific_prompt_kwargs

        * add model_specific_prompt_kwargs to vizwiz_vqa

        * add model_specific_prompt_kwargs for vqav2

        * lint

        * fix a small bug for eval_logger

        * Refactor make_table function to display points as "  -  " if value is None

        * Merge commit '63fc8eee4dddfbe741e5a862e5ff30d19c34238e'

        * Refactor ok_vqa_aggreate_submissions function

        * Merge commit 'd16bbce134d453c624834e090af1e0f869fdde15'

        * Refactor VQA submission file saving

        * Update file utils

        * Merge commit '7332704263a45ab6fa69aad0c4303cd9cbc26813'

        * Refactor file path handling and submission generation

        * OKVQA path

        * vizwizvqa file

        * pack cmmmu

        * fix a small metric bug for cmmmu

        * Add higher_is_better flag to submission metric

        * Add CMMMU dataset to README.md

        * Add logging and refactor submission file generation in docvqa utils.py

        * pack docvqa

        * add traceback to print detailed error

        * Refactor docvqa_test_aggregate_results to accept additional arguments

        * Add metric check in evaluator.py and update test.yaml and val.yaml

        * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

        * merge textvqa

        * textvqa

        * Modify submission file generation for COCO test results

        * Update test result storage path

        * update coco cap file name

        * Update COCO 2017 Caption dataset name

        * ferret

        * Add Ferret dataset

        * Refactor hb_doc_to_text function to include model-specific prompts

        * Add IconQA and its subtasks

        * Refactor image list creation in doc_to_visual function

        * Add process_results function to default template

        * Update process_results function in iconqa utils.py

        * refactor flickr30k

        * change aggregation function

        * Fix formatting issues and update logging message

        * Fix llava can not handle only text question (no visuals)

        * Fix qwen can not handle no image question (no visuals)

        * Add fuyu prepare accelerator scripts

        * refactor mme

        * naming consistency

        * aggregation_submissions consistency

        * flickr30k naming consistency

        * remove submissions for mme

        * remove unused submission function

        * Refactor infovqa_test.yaml and infovqa_val.yaml

        * Refactor code for improved readability and maintainability

        * stvqa

        * remane sqa

        * Update lmms_eval textcaps files and utils.py

        * Update default prompt for text captions

        * Refactor textcaps_aggregation_result function

        * Add generate_submission_file function and update mathvista_aggregate_results signature

        * Update nocaps_test.yaml and nocaps_val.yaml

        * refractor internal_eval

        * Add internal evaluation datasets

        * pack multidocvqa

        * mmvet

        * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

        * Refractor llava wild

        * Refractor llava-bench-coco

        * Add JSON file generation for gpt evaluation details

        * mmmu

        * Remove MMBench English and Chinese tasks

        * Remove unnecessary return statement in mmbench_aggregate_test_results function

        * Fix distributed process group initialization

        * Update dataset paths and group names in mmbench test configs

        * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

        * Add torch module import

        * lint

        * Remove IconQA dataset from README.md

        * Add Multi-DocVQA and its submodules

        * Add new datasets and update task names

        * Refactor flickr_aggregation_result function to accept additional arguments

        * Add timeout kwargs in Accelerator constructor

        * Add encoding to be utf-8 for cmmmu

        * Fix llava try and catch, remove torch.distributed.init in main

        * Ds prepare script for llava

        ---------

        Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 0e0c698
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Feb 27 22:52:07 2024 +0800

        [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

        * Refactor logging in lmms_eval package

        * Refactor variable names in lmms_eval package

    * add llava main in pyproject

    * Update README.md

    * Remove unnecessary dependencies and add specific version for llava_repr

    * Add dependencies for llava_repr***

    * Update README.md

    * add some docs on models and command line commands

    * remove some lines

    * typo

    * Update model_guide.md

    * Update model_guide.md

    * Update README.md

    * Update README.md

    * Update README.md

    * Fix refcocog dataset path

    * Record gpt response in eval info

    * Resolve conflict

    * Fix hallusionbench gpt json saving path

    * Rename hallubench gpt output path

    * Change remove image to check by type instead of check by names

    * More robust check by type

    * Remove unnecessary img in data

    * Forcing an empty commit.

    * Testing

    * Delete unnecessary things

    * Fix seedbench2 image issue in doc_to_text

    * Add conditional exclude for internal eval

    * Fix small bugs in list_with_num

    * Revise list_with_num model args

    * Fix logging utils bug on wandb grouping

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Co-authored-by: jzhang38 <a1286225768@gmail.com>

commit 5902608191d5a8a059c2a267afc0100f47140fae
Merge: 83358a4 fd7773d
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 07:25:48 2024 +0000

    Merge branch 'main' into kc/final_fix

commit 83358a42354d8ec57d3d887e2262f82e7dd4c532
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 07:23:19 2024 +0000

    Fix logging utils bug on wandb grouping

commit fd7773d
Author: kcz358 <92624596+kcz358@users.noreply.github.com>
Date:   Sun Mar 3 13:01:11 2024 +0800

    [Fix] refcocog dataset path, record gpt prompt in internal eval, build context issue (EvolvingLMMs-Lab#59)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

    * Update README.md with new features and installation instructions

    * Update supported models and datasets

    * Delete otter.py file

    * Fix capitalization in README.md

    * Update image sizes and add new features

    * Refactor README.md to improve readability and add new features

    * Add description for lmms-eval in README.md

    * Update accelerator support in README.md

    * Update lmms-eval README with improved description and additional features

    * Update README.md with improved task grouping description

    * change `Otter-AI/MME` to `lmms-lab/MME`

    * Update README.md

    * Update README.md

    * Remove unused code in mme.yaml

    * Squashed commit of the following:

    commit 5c6e0c8
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Thu Feb 29 13:40:02 2024 +0800

        Dev/py add models (EvolvingLMMs-Lab#57)

        * add instructblip

        * minicpm_v

        * remove <image> from qwen-vl

        * speed up postprocessing

        * Optimize build context speed

        ---------

        Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 8bd568e
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Wed Feb 28 14:49:07 2024 +0800

        Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

        * refactor vizwizvqa task

        * Delete vqav2_test and vqav2_val YAML files

        * Refactor vqav2_process_results functions

        * Add a pack for vqav2

        * refactor okvqa

        * roll back vizwiz_vqa

        * Fix exact_match calculation in ok_vqa_process_results

        * Update OKVQA dataset name in readme

        * add model_specific_prompt_kwargs

        * add model_specific_prompt_kwargs to vizwiz_vqa

        * add model_specific_prompt_kwargs for vqav2

        * lint

        * fix a small bug for eval_logger

        * Refactor make_table function to display points as "  -  " if value is None

        * Merge commit '63fc8eee4dddfbe741e5a862e5ff30d19c34238e'

        * Refactor ok_vqa_aggreate_submissions function

        * Merge commit 'd16bbce134d453c624834e090af1e0f869fdde15'

        * Refactor VQA submission file saving

        * Update file utils

        * Merge commit '7332704263a45ab6fa69aad0c4303cd9cbc26813'

        * Refactor file path handling and submission generation

        * OKVQA path

        * vizwizvqa file

        * pack cmmmu

        * fix a small metric bug for cmmmu

        * Add higher_is_better flag to submission metric

        * Add CMMMU dataset to README.md

        * Add logging and refactor submission file generation in docvqa utils.py

        * pack docvqa

        * add traceback to print detailed error

        * Refactor docvqa_test_aggregate_results to accept additional arguments

        * Add metric check in evaluator.py and update test.yaml and val.yaml

        * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

        * merge textvqa

        * textvqa

        * Modify submission file generation for COCO test results

        * Update test result storage path

        * update coco cap file name

        * Update COCO 2017 Caption dataset name

        * ferret

        * Add Ferret dataset

        * Refactor hb_doc_to_text function to include model-specific prompts

        * Add IconQA and its subtasks

        * Refactor image list creation in doc_to_visual function

        * Add process_results function to default template

        * Update process_results function in iconqa utils.py

        * refactor flickr30k

        * change aggregation function

        * Fix formatting issues and update logging message

        * Fix llava can not handle only text question (no visuals)

        * Fix qwen can not handle no image question (no visuals)

        * Add fuyu prepare accelerator scripts

        * refactor mme

        * naming consistency

        * aggregation_submissions consistency

        * flickr30k naming consistency

        * remove submissions for mme

        * remove unused submission function

        * Refactor infovqa_test.yaml and infovqa_val.yaml

        * Refactor code for improved readability and maintainability

        * stvqa

        * remane sqa

        * Update lmms_eval textcaps files and utils.py

        * Update default prompt for text captions

        * Refactor textcaps_aggregation_result function

        * Add generate_submission_file function and update mathvista_aggregate_results signature

        * Update nocaps_test.yaml and nocaps_val.yaml

        * refractor internal_eval

        * Add internal evaluation datasets

        * pack multidocvqa

        * mmvet

        * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

        * Refractor llava wild

        * Refractor llava-bench-coco

        * Add JSON file generation for gpt evaluation details

        * mmmu

        * Remove MMBench English and Chinese tasks

        * Remove unnecessary return statement in mmbench_aggregate_test_results function

        * Fix distributed process group initialization

        * Update dataset paths and group names in mmbench test configs

        * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

        * Add torch module import

        * lint

        * Remove IconQA dataset from README.md

        * Add Multi-DocVQA and its submodules

        * Add new datasets and update task names

        * Refactor flickr_aggregation_result function to accept additional arguments

        * Add timeout kwargs in Accelerator constructor

        * Add encoding to be utf-8 for cmmmu

        * Fix llava try and catch, remove torch.distributed.init in main

        * Ds prepare script for llava

        ---------

        Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 0e0c698
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Feb 27 22:52:07 2024 +0800

        [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

        * Refactor logging in lmms_eval package

        * Refactor variable names in lmms_eval package

    * add llava main in pyproject

    * Update README.md

    * Remove unnecessary dependencies and add specific version for llava_repr

    * Add dependencies for llava_repr***

    * Update README.md

    * add some docs on models and command line commands

    * remove some lines

    * typo

    * Update model_guide.md

    * Update model_guide.md

    * Update README.md

    * Update README.md

    * Update README.md

    * Fix refcocog dataset path

    * Record gpt response in eval info

    * Resolve conflict

    * Fix hallusionbench gpt json saving path

    * Rename hallubench gpt output path

    * Change remove image to check by type instead of check by names

    * More robust check by type

    * Remove unnecessary img in data

    * Forcing an empty commit.

    * Testing

    * Delete unnecessary things

    * Fix seedbench2 image issue in doc_to_text

    * Add conditional exclude for internal eval

    * Fix small bugs in list_with_num

    * Revise list_with_num model args

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Co-authored-by: jzhang38 <a1286225768@gmail.com>

commit ce51924783fa5c50f99815a33988476ee1220bac
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 05:58:08 2024 +0000

    Revise list_with_num model args

commit a288035b48620b827a82c1c45412fe2bb3c18715
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 05:09:15 2024 +0000

    Fix small bugs in list_with_num

commit 584db7fcc0140dd4a6d6481529ae90570b2912c4
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:49:36 2024 +0000

    Add conditional exclude for internal eval

commit 5e52a8df3785eb2d1b392eb164b66e92c9dadb02
Merge: a3cae8e ffb9eb2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:24:29 2024 +0000

    Merge branch 'dev/readme' into kc/final_fix

commit a3cae8e9f3570121d51885c71f7081da36c5d13d
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 02:47:31 2024 +0000

    Fix seedbench2 image issue in doc_to_text

commit 0b3cad596fd58e6414ea015e79bef1eea6eb7f7a
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:32:49 2024 +0000

    Delete unnecessary things

commit f436cb65bd716d93044516ece2133ab5b8d87137
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:31:42 2024 +0000

    Testing

commit 3d47b59f92cef22cfe38e00b407ce38a61d538b2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:29:30 2024 +0000

    Forcing an empty commit.

commit ffb9eb26dae25cda1e0d3e302852862102b47054
Merge: 786f2b5 1700786
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:56 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 1700786b572cbedcb6969ae97028225d388987bb
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:20 2024 +0000

    Remove unnecessary img in data

commit 786f2b53d57265b9900b0718d27538221b5f81b4
Merge: 4240785 888c1c1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:41:24 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 888c1c128319bd04528727a309d0d92aaee9e752
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:40:51 2024 +0000

    More robust check by type

commit 8c74caa2f77940c781501b45571d7c6362c9a6c8
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:00:57 2024 +0000

    Change remove image to check by type instead of check by names

commit 4ab5cc32e3a460ad112dcd3031cea55b6bc0f691
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 12:33:02 2024 +0000

    Rename hallubench gpt output path

commit eae08b536908875eeb600538e853caaa14c655ae
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 09:32:52 2024 +0000

    Fix hallusionbench gpt json saving path

commit 4240785c1bf3a7fd15f36013803c004542a17f2e
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:51:13 2024 +0000

    Resolve conflict

commit c15bf75d2f76e215b4d5de43c1d17b5a41d79753
Merge: 9cf86fa 93534dc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:37:21 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 93534dc4e98b78b9da01099079187d8960705fb8
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:55:03 2024 +0000

    Record gpt response in eval info

commit 05166a14c45063bf108282c3202d32feb2fe0afa
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:49:01 2024 +0000

    Fix refcocog dataset path

commit 5c6e0c8
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 8bd568e
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit '63fc8eee4dddfbe741e5a862e5ff30d19c34238e'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'd16bbce134d453c624834e090af1e0f869fdde15'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '7332704263a45ab6fa69aad0c4303cd9cbc26813'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 0e0c698
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* Update commands.md

* Add repr_scripts for reference

* Add timeout for gpt4V

* Remove unnecessary dependencies

* Add reproduce into readme

* Revise seedbench process_result

* Fix exclude dc hardcode postprocess logic error

* Fix metric repeat issue

* Update dataset runtime and add environment info

* Revise val submission file saving path

* Put the correct query into the gpt extraction

* Update sleep time in utils.py

* update

---------

Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>
Co-authored-by: jzhang38 <a1286225768@gmail.com>
Co-authored-by: kcz358 <kaichenzhang358@outlook.com>
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
* Refactor logging in lmms_eval package

* Refactor variable names in lmms_eval package

* Update README.md with new features and installation instructions

* Update supported models and datasets

* Delete otter.py file

* Fix capitalization in README.md

* Update image sizes and add new features

* Refactor README.md to improve readability and add new features

* Add description for lmms-eval in README.md

* Update accelerator support in README.md

* Update lmms-eval README with improved description and additional features

* Update README.md with improved task grouping description

* change `Otter-AI/MME` to `lmms-lab/MME`

* Update README.md

* Update README.md

* Remove unused code in mme.yaml

* Squashed commit of the following:

commit 90fbf3d
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 0fa3bce
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 0182d5d
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* add llava main in pyproject

* Update README.md

* Remove unnecessary dependencies and add specific version for llava_repr

* Add dependencies for llava_repr***

* Update README.md

* add some docs on models and command line commands

* remove some lines

* typo

* Update model_guide.md

* Update model_guide.md

* Update README.md

* Update README.md

* Update README.md

* Fix refcocog dataset path

* Record gpt response in eval info

* Resolve conflict

* Fix hallusionbench gpt json saving path

* Rename hallubench gpt output path

* Change remove image to check by type instead of check by names

* More robust check by type

* Add timeout to API requests

* Remove unnecessary img in data

* Forcing an empty commit.

* Testing

* Delete unnecessary things

* Fix error logging in get_chat_response function

* Fix seedbench2 image issue in doc_to_text

* Add conditional exclude for internal eval

* Squashed commit of the following:

commit 74a747ff5e5a82cd8f61fb9f5a5357b67c867153
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:49:36 2024 +0000

    Add conditional exclude for internal eval

commit 336de4a8408ece3c0a2b7b5880c00b38015674a1
Merge: a3cae8e ffb9eb2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:24:29 2024 +0000

    Merge branch 'dev/readme' into kc/final_fix

commit 5860f00373890a18ed09870757bcdae9f3821aa1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 02:47:31 2024 +0000

    Fix seedbench2 image issue in doc_to_text

commit 912b73ed809e9242351874ce5b127c218188196d
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:32:49 2024 +0000

    Delete unnecessary things

commit f3f98531fc18a053b1a1bdec6c03757e1334e93b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:31:42 2024 +0000

    Testing

commit ceccc944119c22177e7fe040ba73e468dcf6d419
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:29:30 2024 +0000

    Forcing an empty commit.

commit d970b68e39068deb8308bb20af4266f4d37403df
Merge: 786f2b5 1700786
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:56 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit f0b9201adeb8e2e78886a6746ead6b585430f7d8
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:20 2024 +0000

    Remove unnecessary img in data

commit f9cdc0331bf9ef3f1cca4a3791658b2f31f300ca
Merge: 4240785 888c1c1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:41:24 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit fb4bb090b185f18b8be4ef3353ec659a40e1b508
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:40:51 2024 +0000

    More robust check by type

commit 3d58243e32f551f5427950663157c2a5ce539504
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:00:57 2024 +0000

    Change remove image to check by type instead of check by names

commit 95717b7ce70d40bc12e0b3b5809a686a083903aa
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 12:33:02 2024 +0000

    Rename hallubench gpt output path

commit 07915d5ec5d68ed0cde34bbb6e0c1438757fab72
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 09:32:52 2024 +0000

    Fix hallusionbench gpt json saving path

commit cc8ce2e48c31c5196ad5e0bca871acbe0c7492a1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:51:13 2024 +0000

    Resolve conflict

commit 562bb6c15876164ad49392df1a66ed6af84cac76
Merge: 9cf86fa 93534dc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:37:21 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit f2a585a4e5163b51dc31686a32a8aae7fd8e0751
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:55:03 2024 +0000

    Record gpt response in eval info

commit e3896d1421b5ba5794db227648ca4316a0170569
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:49:01 2024 +0000

    Fix refcocog dataset path

commit 90fbf3d
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 0fa3bce
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 0182d5d
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* Fix small bugs in list_with_num

* Revise list_with_num model args

* Dev/readme rm rolling (EvolvingLMMs-Lab#60)

* remove log_likelyhood_rolling

* Update time efficiency benchmark in README.md

* add task guide

---------

Co-authored-by: jzhang38 <a1286225768@gmail.com>
Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>

* Remove unnecessary code and update dependencies

* Fix logging utils bug on wandb grouping

* Add reproduce envs

* Squashed commit of the following:

commit d1d4ca79d569d5765080160bd8c7e8d432cadd99
Merge: 2475639 f89a736
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 22:12:12 2024 +0800

    Merge branch 'main' into kc/final_fix

commit d1815c3465e43a083ab811e8fc8602911a971413
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 22:11:04 2024 +0800

    Add reproduce envs

commit b8b7f79
Author: kcz358 <92624596+kcz358@users.noreply.github.com>
Date:   Sun Mar 3 21:19:15 2024 +0800

    [Fix] wandb group logging missing columns (EvolvingLMMs-Lab#61)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

    * Update README.md with new features and installation instructions

    * Update supported models and datasets

    * Delete otter.py file

    * Fix capitalization in README.md

    * Update image sizes and add new features

    * Refactor README.md to improve readability and add new features

    * Add description for lmms-eval in README.md

    * Update accelerator support in README.md

    * Update lmms-eval README with improved description and additional features

    * Update README.md with improved task grouping description

    * change `Otter-AI/MME` to `lmms-lab/MME`

    * Update README.md

    * Update README.md

    * Remove unused code in mme.yaml

    * Squashed commit of the following:

    commit 90fbf3d
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Thu Feb 29 13:40:02 2024 +0800

        Dev/py add models (EvolvingLMMs-Lab#57)

        * add instructblip

        * minicpm_v

        * remove <image> from qwen-vl

        * speed up postprocessing

        * Optimize build context speed

        ---------

        Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 0fa3bce
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Wed Feb 28 14:49:07 2024 +0800

        Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

        * refactor vizwizvqa task

        * Delete vqav2_test and vqav2_val YAML files

        * Refactor vqav2_process_results functions

        * Add a pack for vqav2

        * refactor okvqa

        * roll back vizwiz_vqa

        * Fix exact_match calculation in ok_vqa_process_results

        * Update OKVQA dataset name in readme

        * add model_specific_prompt_kwargs

        * add model_specific_prompt_kwargs to vizwiz_vqa

        * add model_specific_prompt_kwargs for vqav2

        * lint

        * fix a small bug for eval_logger

        * Refactor make_table function to display points as "  -  " if value is None

        * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

        * Refactor ok_vqa_aggreate_submissions function

        * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

        * Refactor VQA submission file saving

        * Update file utils

        * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

        * Refactor file path handling and submission generation

        * OKVQA path

        * vizwizvqa file

        * pack cmmmu

        * fix a small metric bug for cmmmu

        * Add higher_is_better flag to submission metric

        * Add CMMMU dataset to README.md

        * Add logging and refactor submission file generation in docvqa utils.py

        * pack docvqa

        * add traceback to print detailed error

        * Refactor docvqa_test_aggregate_results to accept additional arguments

        * Add metric check in evaluator.py and update test.yaml and val.yaml

        * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

        * merge textvqa

        * textvqa

        * Modify submission file generation for COCO test results

        * Update test result storage path

        * update coco cap file name

        * Update COCO 2017 Caption dataset name

        * ferret

        * Add Ferret dataset

        * Refactor hb_doc_to_text function to include model-specific prompts

        * Add IconQA and its subtasks

        * Refactor image list creation in doc_to_visual function

        * Add process_results function to default template

        * Update process_results function in iconqa utils.py

        * refactor flickr30k

        * change aggregation function

        * Fix formatting issues and update logging message

        * Fix llava can not handle only text question (no visuals)

        * Fix qwen can not handle no image question (no visuals)

        * Add fuyu prepare accelerator scripts

        * refactor mme

        * naming consistency

        * aggregation_submissions consistency

        * flickr30k naming consistency

        * remove submissions for mme

        * remove unused submission function

        * Refactor infovqa_test.yaml and infovqa_val.yaml

        * Refactor code for improved readability and maintainability

        * stvqa

        * remane sqa

        * Update lmms_eval textcaps files and utils.py

        * Update default prompt for text captions

        * Refactor textcaps_aggregation_result function

        * Add generate_submission_file function and update mathvista_aggregate_results signature

        * Update nocaps_test.yaml and nocaps_val.yaml

        * refractor internal_eval

        * Add internal evaluation datasets

        * pack multidocvqa

        * mmvet

        * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

        * Refractor llava wild

        * Refractor llava-bench-coco

        * Add JSON file generation for gpt evaluation details

        * mmmu

        * Remove MMBench English and Chinese tasks

        * Remove unnecessary return statement in mmbench_aggregate_test_results function

        * Fix distributed process group initialization

        * Update dataset paths and group names in mmbench test configs

        * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

        * Add torch module import

        * lint

        * Remove IconQA dataset from README.md

        * Add Multi-DocVQA and its submodules

        * Add new datasets and update task names

        * Refactor flickr_aggregation_result function to accept additional arguments

        * Add timeout kwargs in Accelerator constructor

        * Add encoding to be utf-8 for cmmmu

        * Fix llava try and catch, remove torch.distributed.init in main

        * Ds prepare script for llava

        ---------

        Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 0182d5d
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Feb 27 22:52:07 2024 +0800

        [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

        * Refactor logging in lmms_eval package

        * Refactor variable names in lmms_eval package

    * add llava main in pyproject

    * Update README.md

    * Remove unnecessary dependencies and add specific version for llava_repr

    * Add dependencies for llava_repr***

    * Update README.md

    * add some docs on models and command line commands

    * remove some lines

    * typo

    * Update model_guide.md

    * Update model_guide.md

    * Update README.md

    * Update README.md

    * Update README.md

    * Fix refcocog dataset path

    * Record gpt response in eval info

    * Resolve conflict

    * Fix hallusionbench gpt json saving path

    * Rename hallubench gpt output path

    * Change remove image to check by type instead of check by names

    * More robust check by type

    * Remove unnecessary img in data

    * Forcing an empty commit.

    * Testing

    * Delete unnecessary things

    * Fix seedbench2 image issue in doc_to_text

    * Add conditional exclude for internal eval

    * Fix small bugs in list_with_num

    * Revise list_with_num model args

    * Fix logging utils bug on wandb grouping

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Co-authored-by: jzhang38 <a1286225768@gmail.com>

commit bf67bcc02cb57e63952e4429515269458084ea5f
Merge: 83358a4 5e1c9c7
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 07:25:48 2024 +0000

    Merge branch 'main' into kc/final_fix

commit c3e54461dd77f62aa50bcee8fbbebc14e4470644
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 07:23:19 2024 +0000

    Fix logging utils bug on wandb grouping

commit 09eecf5
Author: kcz358 <92624596+kcz358@users.noreply.github.com>
Date:   Sun Mar 3 13:01:11 2024 +0800

    [Fix] refcocog dataset path, record gpt prompt in internal eval, build context issue (EvolvingLMMs-Lab#59)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

    * Update README.md with new features and installation instructions

    * Update supported models and datasets

    * Delete otter.py file

    * Fix capitalization in README.md

    * Update image sizes and add new features

    * Refactor README.md to improve readability and add new features

    * Add description for lmms-eval in README.md

    * Update accelerator support in README.md

    * Update lmms-eval README with improved description and additional features

    * Update README.md with improved task grouping description

    * change `Otter-AI/MME` to `lmms-lab/MME`

    * Update README.md

    * Update README.md

    * Remove unused code in mme.yaml

    * Squashed commit of the following:

    commit 90fbf3d
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Thu Feb 29 13:40:02 2024 +0800

        Dev/py add models (EvolvingLMMs-Lab#57)

        * add instructblip

        * minicpm_v

        * remove <image> from qwen-vl

        * speed up postprocessing

        * Optimize build context speed

        ---------

        Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 0fa3bce
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Wed Feb 28 14:49:07 2024 +0800

        Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

        * refactor vizwizvqa task

        * Delete vqav2_test and vqav2_val YAML files

        * Refactor vqav2_process_results functions

        * Add a pack for vqav2

        * refactor okvqa

        * roll back vizwiz_vqa

        * Fix exact_match calculation in ok_vqa_process_results

        * Update OKVQA dataset name in readme

        * add model_specific_prompt_kwargs

        * add model_specific_prompt_kwargs to vizwiz_vqa

        * add model_specific_prompt_kwargs for vqav2

        * lint

        * fix a small bug for eval_logger

        * Refactor make_table function to display points as "  -  " if value is None

        * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

        * Refactor ok_vqa_aggreate_submissions function

        * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

        * Refactor VQA submission file saving

        * Update file utils

        * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

        * Refactor file path handling and submission generation

        * OKVQA path

        * vizwizvqa file

        * pack cmmmu

        * fix a small metric bug for cmmmu

        * Add higher_is_better flag to submission metric

        * Add CMMMU dataset to README.md

        * Add logging and refactor submission file generation in docvqa utils.py

        * pack docvqa

        * add traceback to print detailed error

        * Refactor docvqa_test_aggregate_results to accept additional arguments

        * Add metric check in evaluator.py and update test.yaml and val.yaml

        * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

        * merge textvqa

        * textvqa

        * Modify submission file generation for COCO test results

        * Update test result storage path

        * update coco cap file name

        * Update COCO 2017 Caption dataset name

        * ferret

        * Add Ferret dataset

        * Refactor hb_doc_to_text function to include model-specific prompts

        * Add IconQA and its subtasks

        * Refactor image list creation in doc_to_visual function

        * Add process_results function to default template

        * Update process_results function in iconqa utils.py

        * refactor flickr30k

        * change aggregation function

        * Fix formatting issues and update logging message

        * Fix llava can not handle only text question (no visuals)

        * Fix qwen can not handle no image question (no visuals)

        * Add fuyu prepare accelerator scripts

        * refactor mme

        * naming consistency

        * aggregation_submissions consistency

        * flickr30k naming consistency

        * remove submissions for mme

        * remove unused submission function

        * Refactor infovqa_test.yaml and infovqa_val.yaml

        * Refactor code for improved readability and maintainability

        * stvqa

        * remane sqa

        * Update lmms_eval textcaps files and utils.py

        * Update default prompt for text captions

        * Refactor textcaps_aggregation_result function

        * Add generate_submission_file function and update mathvista_aggregate_results signature

        * Update nocaps_test.yaml and nocaps_val.yaml

        * refractor internal_eval

        * Add internal evaluation datasets

        * pack multidocvqa

        * mmvet

        * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

        * Refractor llava wild

        * Refractor llava-bench-coco

        * Add JSON file generation for gpt evaluation details

        * mmmu

        * Remove MMBench English and Chinese tasks

        * Remove unnecessary return statement in mmbench_aggregate_test_results function

        * Fix distributed process group initialization

        * Update dataset paths and group names in mmbench test configs

        * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

        * Add torch module import

        * lint

        * Remove IconQA dataset from README.md

        * Add Multi-DocVQA and its submodules

        * Add new datasets and update task names

        * Refactor flickr_aggregation_result function to accept additional arguments

        * Add timeout kwargs in Accelerator constructor

        * Add encoding to be utf-8 for cmmmu

        * Fix llava try and catch, remove torch.distributed.init in main

        * Ds prepare script for llava

        ---------

        Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 0182d5d
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Feb 27 22:52:07 2024 +0800

        [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

        * Refactor logging in lmms_eval package

        * Refactor variable names in lmms_eval package

    * add llava main in pyproject

    * Update README.md

    * Remove unnecessary dependencies and add specific version for llava_repr

    * Add dependencies for llava_repr***

    * Update README.md

    * add some docs on models and command line commands

    * remove some lines

    * typo

    * Update model_guide.md

    * Update model_guide.md

    * Update README.md

    * Update README.md

    * Update README.md

    * Fix refcocog dataset path

    * Record gpt response in eval info

    * Resolve conflict

    * Fix hallusionbench gpt json saving path

    * Rename hallubench gpt output path

    * Change remove image to check by type instead of check by names

    * More robust check by type

    * Remove unnecessary img in data

    * Forcing an empty commit.

    * Testing

    * Delete unnecessary things

    * Fix seedbench2 image issue in doc_to_text

    * Add conditional exclude for internal eval

    * Fix small bugs in list_with_num

    * Revise list_with_num model args

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Co-authored-by: jzhang38 <a1286225768@gmail.com>

commit a0ce88c84a9122b793a6b6d352896767fed1f18a
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 05:58:08 2024 +0000

    Revise list_with_num model args

commit b892d8eac7f656fafa5d6425b94b3d089e4a5268
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 05:09:15 2024 +0000

    Fix small bugs in list_with_num

commit 74a747ff5e5a82cd8f61fb9f5a5357b67c867153
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:49:36 2024 +0000

    Add conditional exclude for internal eval

commit 336de4a8408ece3c0a2b7b5880c00b38015674a1
Merge: a3cae8e ffb9eb2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:24:29 2024 +0000

    Merge branch 'dev/readme' into kc/final_fix

commit 5860f00373890a18ed09870757bcdae9f3821aa1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 02:47:31 2024 +0000

    Fix seedbench2 image issue in doc_to_text

commit 912b73ed809e9242351874ce5b127c218188196d
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:32:49 2024 +0000

    Delete unnecessary things

commit f3f98531fc18a053b1a1bdec6c03757e1334e93b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:31:42 2024 +0000

    Testing

commit ceccc944119c22177e7fe040ba73e468dcf6d419
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:29:30 2024 +0000

    Forcing an empty commit.

commit d970b68e39068deb8308bb20af4266f4d37403df
Merge: 786f2b5 1700786
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:56 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit f0b9201adeb8e2e78886a6746ead6b585430f7d8
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:20 2024 +0000

    Remove unnecessary img in data

commit f9cdc0331bf9ef3f1cca4a3791658b2f31f300ca
Merge: 4240785 888c1c1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:41:24 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit fb4bb090b185f18b8be4ef3353ec659a40e1b508
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:40:51 2024 +0000

    More robust check by type

commit 3d58243e32f551f5427950663157c2a5ce539504
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:00:57 2024 +0000

    Change remove image to check by type instead of check by names

commit 95717b7ce70d40bc12e0b3b5809a686a083903aa
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 12:33:02 2024 +0000

    Rename hallubench gpt output path

commit 07915d5ec5d68ed0cde34bbb6e0c1438757fab72
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 09:32:52 2024 +0000

    Fix hallusionbench gpt json saving path

commit cc8ce2e48c31c5196ad5e0bca871acbe0c7492a1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:51:13 2024 +0000

    Resolve conflict

commit 562bb6c15876164ad49392df1a66ed6af84cac76
Merge: 9cf86fa 93534dc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:37:21 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit f2a585a4e5163b51dc31686a32a8aae7fd8e0751
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:55:03 2024 +0000

    Record gpt response in eval info

commit e3896d1421b5ba5794db227648ca4316a0170569
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:49:01 2024 +0000

    Fix refcocog dataset path

commit 90fbf3d
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 0fa3bce
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 0182d5d
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* Update commands.md

* Add repr_scripts for reference

* Add timeout for gpt4V

* Remove unnecessary dependencies

* Add reproduce into readme

* Revise seedbench process_result

* Fix exclude dc hardcode postprocess logic error

* Fix metric repeat issue

* Update dataset runtime and add environment info

* Revise val submission file saving path

* Put the correct query into the gpt extraction

* Update sleep time in utils.py

* update

---------

Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>
Co-authored-by: jzhang38 <a1286225768@gmail.com>
Co-authored-by: kcz358 <kaichenzhang358@outlook.com>
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
* Refactor logging in lmms_eval package

* Refactor variable names in lmms_eval package

* Update README.md with new features and installation instructions

* Update supported models and datasets

* Delete otter.py file

* Fix capitalization in README.md

* Update image sizes and add new features

* Refactor README.md to improve readability and add new features

* Add description for lmms-eval in README.md

* Update accelerator support in README.md

* Update lmms-eval README with improved description and additional features

* Update README.md with improved task grouping description

* change `Otter-AI/MME` to `lmms-lab/MME`

* Update README.md

* Update README.md

* Remove unused code in mme.yaml

* Squashed commit of the following:

commit 9c0bc58
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 30ab0ce
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit '5e73e8b8a2408bd8193361788669ca80db19cb04'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit '40099e8b8145bde513b9b7cef8461d8f13d1eafe'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit 'a56fe11c00ad4a8b8967be88b93baef6649528c5'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit a5b07ee
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* add llava main in pyproject

* Update README.md

* Remove unnecessary dependencies and add specific version for llava_repr

* Add dependencies for llava_repr***

* Update README.md

* add some docs on models and command line commands

* remove some lines

* typo

* Update model_guide.md

* Update model_guide.md

* Update README.md

* Update README.md

* Update README.md

* Fix refcocog dataset path

* Record gpt response in eval info

* Resolve conflict

* Fix hallusionbench gpt json saving path

* Rename hallubench gpt output path

* Change remove image to check by type instead of check by names

* More robust check by type

* Add timeout to API requests

* Remove unnecessary img in data

* Forcing an empty commit.

* Testing

* Delete unnecessary things

* Fix error logging in get_chat_response function

* Fix seedbench2 image issue in doc_to_text

* Add conditional exclude for internal eval

* Squashed commit of the following:

commit 1cf38b3ad6c7799957901d836299243cc21718f5
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:49:36 2024 +0000

    Add conditional exclude for internal eval

commit 62527c874431508b7731ad49ff1f1526104703cd
Merge: a3cae8e ffb9eb2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:24:29 2024 +0000

    Merge branch 'dev/readme' into kc/final_fix

commit 522f36aca8354f5efa7fff6d23bd90e885bcf1ab
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 02:47:31 2024 +0000

    Fix seedbench2 image issue in doc_to_text

commit 4ee323a5b19382dbd9ba62f5002042d0746c374e
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:32:49 2024 +0000

    Delete unnecessary things

commit 3d3e164489cb4bd2db342ae085da9613ee7de660
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:31:42 2024 +0000

    Testing

commit 8a4f586d7232a4d89977cef140900728d4517b72
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:29:30 2024 +0000

    Forcing an empty commit.

commit 33dd5b0e0006882e735b7ea1908fdb6ad37c825a
Merge: 786f2b5 1700786
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:56 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit f19de3e7aaf5151d5ce9c63a2b9ee393c6282dfa
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:20 2024 +0000

    Remove unnecessary img in data

commit e1f8cad15ddc2e385a3f2a778a4af57e1072987c
Merge: 4240785 888c1c1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:41:24 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 472b6b1ed2d5bc10ff1d6df8e435f33dc821ad4b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:40:51 2024 +0000

    More robust check by type

commit 367c021bd50068baf024bea3afde4ed58aa38b81
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:00:57 2024 +0000

    Change remove image to check by type instead of check by names

commit 0a466e16d983392cbf0580733500c0890521df93
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 12:33:02 2024 +0000

    Rename hallubench gpt output path

commit 6feceda2c1d631243c78fd7805dcdde4d0e8912f
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 09:32:52 2024 +0000

    Fix hallusionbench gpt json saving path

commit db1f731ee5aff4618edefed018e982f83add0c9a
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:51:13 2024 +0000

    Resolve conflict

commit c8a5e1129310ed1ce1fd86f43bb49da701140383
Merge: 9cf86fa 93534dc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:37:21 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit de53ceaeff08dc7c01962c704e06d7b87f804ec7
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:55:03 2024 +0000

    Record gpt response in eval info

commit e372631e911f2e03cc4f579e291e1198c4c11298
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:49:01 2024 +0000

    Fix refcocog dataset path

commit 9c0bc58
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 30ab0ce
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit '5e73e8b8a2408bd8193361788669ca80db19cb04'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit '40099e8b8145bde513b9b7cef8461d8f13d1eafe'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit 'a56fe11c00ad4a8b8967be88b93baef6649528c5'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit a5b07ee
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* Fix small bugs in list_with_num

* Revise list_with_num model args

* Dev/readme rm rolling (EvolvingLMMs-Lab#60)

* remove log_likelyhood_rolling

* Update time efficiency benchmark in README.md

* add task guide

---------

Co-authored-by: jzhang38 <a1286225768@gmail.com>
Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>

* Remove unnecessary code and update dependencies

* Fix logging utils bug on wandb grouping

* Add reproduce envs

* Squashed commit of the following:

commit cf18d7a1300311ffe1c9671fff7fa0c0d1cf2476
Merge: 2475639 f89a736
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 22:12:12 2024 +0800

    Merge branch 'main' into kc/final_fix

commit 35e5a937bcf924d6b787ce37c6da9f0f54674da9
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 22:11:04 2024 +0800

    Add reproduce envs

commit 13179f9
Author: kcz358 <92624596+kcz358@users.noreply.github.com>
Date:   Sun Mar 3 21:19:15 2024 +0800

    [Fix] wandb group logging missing columns (EvolvingLMMs-Lab#61)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

    * Update README.md with new features and installation instructions

    * Update supported models and datasets

    * Delete otter.py file

    * Fix capitalization in README.md

    * Update image sizes and add new features

    * Refactor README.md to improve readability and add new features

    * Add description for lmms-eval in README.md

    * Update accelerator support in README.md

    * Update lmms-eval README with improved description and additional features

    * Update README.md with improved task grouping description

    * change `Otter-AI/MME` to `lmms-lab/MME`

    * Update README.md

    * Update README.md

    * Remove unused code in mme.yaml

    * Squashed commit of the following:

    commit 9c0bc58
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Thu Feb 29 13:40:02 2024 +0800

        Dev/py add models (EvolvingLMMs-Lab#57)

        * add instructblip

        * minicpm_v

        * remove <image> from qwen-vl

        * speed up postprocessing

        * Optimize build context speed

        ---------

        Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 30ab0ce
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Wed Feb 28 14:49:07 2024 +0800

        Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

        * refactor vizwizvqa task

        * Delete vqav2_test and vqav2_val YAML files

        * Refactor vqav2_process_results functions

        * Add a pack for vqav2

        * refactor okvqa

        * roll back vizwiz_vqa

        * Fix exact_match calculation in ok_vqa_process_results

        * Update OKVQA dataset name in readme

        * add model_specific_prompt_kwargs

        * add model_specific_prompt_kwargs to vizwiz_vqa

        * add model_specific_prompt_kwargs for vqav2

        * lint

        * fix a small bug for eval_logger

        * Refactor make_table function to display points as "  -  " if value is None

        * Merge commit '5e73e8b8a2408bd8193361788669ca80db19cb04'

        * Refactor ok_vqa_aggreate_submissions function

        * Merge commit '40099e8b8145bde513b9b7cef8461d8f13d1eafe'

        * Refactor VQA submission file saving

        * Update file utils

        * Merge commit 'a56fe11c00ad4a8b8967be88b93baef6649528c5'

        * Refactor file path handling and submission generation

        * OKVQA path

        * vizwizvqa file

        * pack cmmmu

        * fix a small metric bug for cmmmu

        * Add higher_is_better flag to submission metric

        * Add CMMMU dataset to README.md

        * Add logging and refactor submission file generation in docvqa utils.py

        * pack docvqa

        * add traceback to print detailed error

        * Refactor docvqa_test_aggregate_results to accept additional arguments

        * Add metric check in evaluator.py and update test.yaml and val.yaml

        * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

        * merge textvqa

        * textvqa

        * Modify submission file generation for COCO test results

        * Update test result storage path

        * update coco cap file name

        * Update COCO 2017 Caption dataset name

        * ferret

        * Add Ferret dataset

        * Refactor hb_doc_to_text function to include model-specific prompts

        * Add IconQA and its subtasks

        * Refactor image list creation in doc_to_visual function

        * Add process_results function to default template

        * Update process_results function in iconqa utils.py

        * refactor flickr30k

        * change aggregation function

        * Fix formatting issues and update logging message

        * Fix llava can not handle only text question (no visuals)

        * Fix qwen can not handle no image question (no visuals)

        * Add fuyu prepare accelerator scripts

        * refactor mme

        * naming consistency

        * aggregation_submissions consistency

        * flickr30k naming consistency

        * remove submissions for mme

        * remove unused submission function

        * Refactor infovqa_test.yaml and infovqa_val.yaml

        * Refactor code for improved readability and maintainability

        * stvqa

        * remane sqa

        * Update lmms_eval textcaps files and utils.py

        * Update default prompt for text captions

        * Refactor textcaps_aggregation_result function

        * Add generate_submission_file function and update mathvista_aggregate_results signature

        * Update nocaps_test.yaml and nocaps_val.yaml

        * refractor internal_eval

        * Add internal evaluation datasets

        * pack multidocvqa

        * mmvet

        * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

        * Refractor llava wild

        * Refractor llava-bench-coco

        * Add JSON file generation for gpt evaluation details

        * mmmu

        * Remove MMBench English and Chinese tasks

        * Remove unnecessary return statement in mmbench_aggregate_test_results function

        * Fix distributed process group initialization

        * Update dataset paths and group names in mmbench test configs

        * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

        * Add torch module import

        * lint

        * Remove IconQA dataset from README.md

        * Add Multi-DocVQA and its submodules

        * Add new datasets and update task names

        * Refactor flickr_aggregation_result function to accept additional arguments

        * Add timeout kwargs in Accelerator constructor

        * Add encoding to be utf-8 for cmmmu

        * Fix llava try and catch, remove torch.distributed.init in main

        * Ds prepare script for llava

        ---------

        Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit a5b07ee
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Feb 27 22:52:07 2024 +0800

        [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

        * Refactor logging in lmms_eval package

        * Refactor variable names in lmms_eval package

    * add llava main in pyproject

    * Update README.md

    * Remove unnecessary dependencies and add specific version for llava_repr

    * Add dependencies for llava_repr***

    * Update README.md

    * add some docs on models and command line commands

    * remove some lines

    * typo

    * Update model_guide.md

    * Update model_guide.md

    * Update README.md

    * Update README.md

    * Update README.md

    * Fix refcocog dataset path

    * Record gpt response in eval info

    * Resolve conflict

    * Fix hallusionbench gpt json saving path

    * Rename hallubench gpt output path

    * Change remove image to check by type instead of check by names

    * More robust check by type

    * Remove unnecessary img in data

    * Forcing an empty commit.

    * Testing

    * Delete unnecessary things

    * Fix seedbench2 image issue in doc_to_text

    * Add conditional exclude for internal eval

    * Fix small bugs in list_with_num

    * Revise list_with_num model args

    * Fix logging utils bug on wandb grouping

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Co-authored-by: jzhang38 <a1286225768@gmail.com>

commit 39ce670fb1992c5e30d4b0eff9636a88a1ce83f5
Merge: 83358a4 5e1c9c7
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 07:25:48 2024 +0000

    Merge branch 'main' into kc/final_fix

commit 36eeaa08730cd3e6a7e90e7000f61b4ebb075524
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 07:23:19 2024 +0000

    Fix logging utils bug on wandb grouping

commit 9ac7212
Author: kcz358 <92624596+kcz358@users.noreply.github.com>
Date:   Sun Mar 3 13:01:11 2024 +0800

    [Fix] refcocog dataset path, record gpt prompt in internal eval, build context issue (EvolvingLMMs-Lab#59)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

    * Update README.md with new features and installation instructions

    * Update supported models and datasets

    * Delete otter.py file

    * Fix capitalization in README.md

    * Update image sizes and add new features

    * Refactor README.md to improve readability and add new features

    * Add description for lmms-eval in README.md

    * Update accelerator support in README.md

    * Update lmms-eval README with improved description and additional features

    * Update README.md with improved task grouping description

    * change `Otter-AI/MME` to `lmms-lab/MME`

    * Update README.md

    * Update README.md

    * Remove unused code in mme.yaml

    * Squashed commit of the following:

    commit 9c0bc58
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Thu Feb 29 13:40:02 2024 +0800

        Dev/py add models (EvolvingLMMs-Lab#57)

        * add instructblip

        * minicpm_v

        * remove <image> from qwen-vl

        * speed up postprocessing

        * Optimize build context speed

        ---------

        Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 30ab0ce
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Wed Feb 28 14:49:07 2024 +0800

        Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

        * refactor vizwizvqa task

        * Delete vqav2_test and vqav2_val YAML files

        * Refactor vqav2_process_results functions

        * Add a pack for vqav2

        * refactor okvqa

        * roll back vizwiz_vqa

        * Fix exact_match calculation in ok_vqa_process_results

        * Update OKVQA dataset name in readme

        * add model_specific_prompt_kwargs

        * add model_specific_prompt_kwargs to vizwiz_vqa

        * add model_specific_prompt_kwargs for vqav2

        * lint

        * fix a small bug for eval_logger

        * Refactor make_table function to display points as "  -  " if value is None

        * Merge commit '5e73e8b8a2408bd8193361788669ca80db19cb04'

        * Refactor ok_vqa_aggreate_submissions function

        * Merge commit '40099e8b8145bde513b9b7cef8461d8f13d1eafe'

        * Refactor VQA submission file saving

        * Update file utils

        * Merge commit 'a56fe11c00ad4a8b8967be88b93baef6649528c5'

        * Refactor file path handling and submission generation

        * OKVQA path

        * vizwizvqa file

        * pack cmmmu

        * fix a small metric bug for cmmmu

        * Add higher_is_better flag to submission metric

        * Add CMMMU dataset to README.md

        * Add logging and refactor submission file generation in docvqa utils.py

        * pack docvqa

        * add traceback to print detailed error

        * Refactor docvqa_test_aggregate_results to accept additional arguments

        * Add metric check in evaluator.py and update test.yaml and val.yaml

        * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

        * merge textvqa

        * textvqa

        * Modify submission file generation for COCO test results

        * Update test result storage path

        * update coco cap file name

        * Update COCO 2017 Caption dataset name

        * ferret

        * Add Ferret dataset

        * Refactor hb_doc_to_text function to include model-specific prompts

        * Add IconQA and its subtasks

        * Refactor image list creation in doc_to_visual function

        * Add process_results function to default template

        * Update process_results function in iconqa utils.py

        * refactor flickr30k

        * change aggregation function

        * Fix formatting issues and update logging message

        * Fix llava can not handle only text question (no visuals)

        * Fix qwen can not handle no image question (no visuals)

        * Add fuyu prepare accelerator scripts

        * refactor mme

        * naming consistency

        * aggregation_submissions consistency

        * flickr30k naming consistency

        * remove submissions for mme

        * remove unused submission function

        * Refactor infovqa_test.yaml and infovqa_val.yaml

        * Refactor code for improved readability and maintainability

        * stvqa

        * remane sqa

        * Update lmms_eval textcaps files and utils.py

        * Update default prompt for text captions

        * Refactor textcaps_aggregation_result function

        * Add generate_submission_file function and update mathvista_aggregate_results signature

        * Update nocaps_test.yaml and nocaps_val.yaml

        * refractor internal_eval

        * Add internal evaluation datasets

        * pack multidocvqa

        * mmvet

        * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

        * Refractor llava wild

        * Refractor llava-bench-coco

        * Add JSON file generation for gpt evaluation details

        * mmmu

        * Remove MMBench English and Chinese tasks

        * Remove unnecessary return statement in mmbench_aggregate_test_results function

        * Fix distributed process group initialization

        * Update dataset paths and group names in mmbench test configs

        * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

        * Add torch module import

        * lint

        * Remove IconQA dataset from README.md

        * Add Multi-DocVQA and its submodules

        * Add new datasets and update task names

        * Refactor flickr_aggregation_result function to accept additional arguments

        * Add timeout kwargs in Accelerator constructor

        * Add encoding to be utf-8 for cmmmu

        * Fix llava try and catch, remove torch.distributed.init in main

        * Ds prepare script for llava

        ---------

        Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit a5b07ee
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Feb 27 22:52:07 2024 +0800

        [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

        * Refactor logging in lmms_eval package

        * Refactor variable names in lmms_eval package

    * add llava main in pyproject

    * Update README.md

    * Remove unnecessary dependencies and add specific version for llava_repr

    * Add dependencies for llava_repr***

    * Update README.md

    * add some docs on models and command line commands

    * remove some lines

    * typo

    * Update model_guide.md

    * Update model_guide.md

    * Update README.md

    * Update README.md

    * Update README.md

    * Fix refcocog dataset path

    * Record gpt response in eval info

    * Resolve conflict

    * Fix hallusionbench gpt json saving path

    * Rename hallubench gpt output path

    * Change remove image to check by type instead of check by names

    * More robust check by type

    * Remove unnecessary img in data

    * Forcing an empty commit.

    * Testing

    * Delete unnecessary things

    * Fix seedbench2 image issue in doc_to_text

    * Add conditional exclude for internal eval

    * Fix small bugs in list_with_num

    * Revise list_with_num model args

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Co-authored-by: jzhang38 <a1286225768@gmail.com>

commit 22fda28d8aa2a53405f15d179ea9baaf53a19b0b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 05:58:08 2024 +0000

    Revise list_with_num model args

commit 48d92eb823b7929ea4c7b0da9f2284ec194c71cf
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 05:09:15 2024 +0000

    Fix small bugs in list_with_num

commit 1cf38b3ad6c7799957901d836299243cc21718f5
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:49:36 2024 +0000

    Add conditional exclude for internal eval

commit 62527c874431508b7731ad49ff1f1526104703cd
Merge: a3cae8e ffb9eb2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:24:29 2024 +0000

    Merge branch 'dev/readme' into kc/final_fix

commit 522f36aca8354f5efa7fff6d23bd90e885bcf1ab
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 02:47:31 2024 +0000

    Fix seedbench2 image issue in doc_to_text

commit 4ee323a5b19382dbd9ba62f5002042d0746c374e
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:32:49 2024 +0000

    Delete unnecessary things

commit 3d3e164489cb4bd2db342ae085da9613ee7de660
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:31:42 2024 +0000

    Testing

commit 8a4f586d7232a4d89977cef140900728d4517b72
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:29:30 2024 +0000

    Forcing an empty commit.

commit 33dd5b0e0006882e735b7ea1908fdb6ad37c825a
Merge: 786f2b5 1700786
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:56 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit f19de3e7aaf5151d5ce9c63a2b9ee393c6282dfa
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:20 2024 +0000

    Remove unnecessary img in data

commit e1f8cad15ddc2e385a3f2a778a4af57e1072987c
Merge: 4240785 888c1c1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:41:24 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 472b6b1ed2d5bc10ff1d6df8e435f33dc821ad4b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:40:51 2024 +0000

    More robust check by type

commit 367c021bd50068baf024bea3afde4ed58aa38b81
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:00:57 2024 +0000

    Change remove image to check by type instead of check by names

commit 0a466e16d983392cbf0580733500c0890521df93
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 12:33:02 2024 +0000

    Rename hallubench gpt output path

commit 6feceda2c1d631243c78fd7805dcdde4d0e8912f
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 09:32:52 2024 +0000

    Fix hallusionbench gpt json saving path

commit db1f731ee5aff4618edefed018e982f83add0c9a
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:51:13 2024 +0000

    Resolve conflict

commit c8a5e1129310ed1ce1fd86f43bb49da701140383
Merge: 9cf86fa 93534dc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:37:21 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit de53ceaeff08dc7c01962c704e06d7b87f804ec7
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:55:03 2024 +0000

    Record gpt response in eval info

commit e372631e911f2e03cc4f579e291e1198c4c11298
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:49:01 2024 +0000

    Fix refcocog dataset path

commit 9c0bc58
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 30ab0ce
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit '5e73e8b8a2408bd8193361788669ca80db19cb04'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit '40099e8b8145bde513b9b7cef8461d8f13d1eafe'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit 'a56fe11c00ad4a8b8967be88b93baef6649528c5'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit a5b07ee
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* Update commands.md

* Add repr_scripts for reference

* Add timeout for gpt4V

* Remove unnecessary dependencies

* Add reproduce into readme

* Revise seedbench process_result

* Fix exclude dc hardcode postprocess logic error

* Fix metric repeat issue

* Update dataset runtime and add environment info

* Revise val submission file saving path

* Put the correct query into the gpt extraction

* Update sleep time in utils.py

* update

---------

Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>
Co-authored-by: jzhang38 <a1286225768@gmail.com>
Co-authored-by: kcz358 <kaichenzhang358@outlook.com>
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
* Refactor logging in lmms_eval package

* Refactor variable names in lmms_eval package

* Update README.md with new features and installation instructions

* Update supported models and datasets

* Delete otter.py file

* Fix capitalization in README.md

* Update image sizes and add new features

* Refactor README.md to improve readability and add new features

* Add description for lmms-eval in README.md

* Update accelerator support in README.md

* Update lmms-eval README with improved description and additional features

* Update README.md with improved task grouping description

* change `Otter-AI/MME` to `lmms-lab/MME`

* Update README.md

* Update README.md

* Remove unused code in mme.yaml

* Squashed commit of the following:

commit b3f1eff
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 0f26c8a
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit fefc964
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* add llava main in pyproject

* Update README.md

* Remove unnecessary dependencies and add specific version for llava_repr

* Add dependencies for llava_repr***

* Update README.md

* add some docs on models and command line commands

* remove some lines

* typo

* Update model_guide.md

* Update model_guide.md

* Update README.md

* Update README.md

* Update README.md

* Fix refcocog dataset path

* Record gpt response in eval info

* Resolve conflict

* Fix hallusionbench gpt json saving path

* Rename hallubench gpt output path

* Change remove image to check by type instead of check by names

* More robust check by type

* Add timeout to API requests

* Remove unnecessary img in data

* Forcing an empty commit.

* Testing

* Delete unnecessary things

* Fix error logging in get_chat_response function

* Fix seedbench2 image issue in doc_to_text

* Add conditional exclude for internal eval

* Squashed commit of the following:

commit 6bb0667ea746cc1dfa9442882f517edd47694d3e
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:49:36 2024 +0000

    Add conditional exclude for internal eval

commit e4ab9fc9ec7d77850ecc05bd33256909cdf62513
Merge: a3cae8e ffb9eb2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:24:29 2024 +0000

    Merge branch 'dev/readme' into kc/final_fix

commit 74c28de92a5794054d7c937b727fba3a8e5821c3
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 02:47:31 2024 +0000

    Fix seedbench2 image issue in doc_to_text

commit 279be1be1e2a839c97e58289362d6828e95e064a
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:32:49 2024 +0000

    Delete unnecessary things

commit 666f3146feef55f898f710254824d4b2c57e6747
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:31:42 2024 +0000

    Testing

commit 1f8d04d20feb6363615537ab47f8a1241c4ee692
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:29:30 2024 +0000

    Forcing an empty commit.

commit 985194e49f519ce04bdc2c0ce00eee3ab6c02def
Merge: 786f2b5 1700786
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:56 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit ef5a0a3b46acc36255c28781d8d66fc9bd32d47b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:20 2024 +0000

    Remove unnecessary img in data

commit e793fd1da7416d7938a6f9e98728692c04264a97
Merge: 4240785 888c1c1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:41:24 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit c1ae0a853bfdcc7d59e3d9fa0eaa78d4d1f01336
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:40:51 2024 +0000

    More robust check by type

commit 3ca0112d74b957f4d4ca20be5573deb8141793c7
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:00:57 2024 +0000

    Change remove image to check by type instead of check by names

commit 821398fde93ccd52eac2f4bbfb8c2e787a10b987
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 12:33:02 2024 +0000

    Rename hallubench gpt output path

commit 5172c13fb3b212c0d175987727433320a1faacbc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 09:32:52 2024 +0000

    Fix hallusionbench gpt json saving path

commit 12a243c8bee0be6ffacf17e46143519734c310d5
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:51:13 2024 +0000

    Resolve conflict

commit 2aded15347d10078c49606b690d05935ad29e6d1
Merge: 9cf86fa 93534dc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:37:21 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 9d499f198a9bdab2177bedfd3980c00934c684ff
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:55:03 2024 +0000

    Record gpt response in eval info

commit c5431b5b80cbaf6e11d840ecb1d0734d680ac41b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:49:01 2024 +0000

    Fix refcocog dataset path

commit b3f1eff
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 0f26c8a
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit fefc964
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* Fix small bugs in list_with_num

* Revise list_with_num model args

* Dev/readme rm rolling (EvolvingLMMs-Lab#60)

* remove log_likelyhood_rolling

* Update time efficiency benchmark in README.md

* add task guide

---------

Co-authored-by: jzhang38 <a1286225768@gmail.com>
Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>

* Remove unnecessary code and update dependencies

* Fix logging utils bug on wandb grouping

* Add reproduce envs

* Squashed commit of the following:

commit 084a21394643acd741fe0969dd0d3f6c6c734853
Merge: 2475639 f89a736
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 22:12:12 2024 +0800

    Merge branch 'main' into kc/final_fix

commit 803d0aec82a57de2ddf1527044f14ed968c30e25
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 22:11:04 2024 +0800

    Add reproduce envs

commit c5344f6
Author: kcz358 <92624596+kcz358@users.noreply.github.com>
Date:   Sun Mar 3 21:19:15 2024 +0800

    [Fix] wandb group logging missing columns (EvolvingLMMs-Lab#61)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

    * Update README.md with new features and installation instructions

    * Update supported models and datasets

    * Delete otter.py file

    * Fix capitalization in README.md

    * Update image sizes and add new features

    * Refactor README.md to improve readability and add new features

    * Add description for lmms-eval in README.md

    * Update accelerator support in README.md

    * Update lmms-eval README with improved description and additional features

    * Update README.md with improved task grouping description

    * change `Otter-AI/MME` to `lmms-lab/MME`

    * Update README.md

    * Update README.md

    * Remove unused code in mme.yaml

    * Squashed commit of the following:

    commit b3f1eff
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Thu Feb 29 13:40:02 2024 +0800

        Dev/py add models (EvolvingLMMs-Lab#57)

        * add instructblip

        * minicpm_v

        * remove <image> from qwen-vl

        * speed up postprocessing

        * Optimize build context speed

        ---------

        Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 0f26c8a
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Wed Feb 28 14:49:07 2024 +0800

        Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

        * refactor vizwizvqa task

        * Delete vqav2_test and vqav2_val YAML files

        * Refactor vqav2_process_results functions

        * Add a pack for vqav2

        * refactor okvqa

        * roll back vizwiz_vqa

        * Fix exact_match calculation in ok_vqa_process_results

        * Update OKVQA dataset name in readme

        * add model_specific_prompt_kwargs

        * add model_specific_prompt_kwargs to vizwiz_vqa

        * add model_specific_prompt_kwargs for vqav2

        * lint

        * fix a small bug for eval_logger

        * Refactor make_table function to display points as "  -  " if value is None

        * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

        * Refactor ok_vqa_aggreate_submissions function

        * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

        * Refactor VQA submission file saving

        * Update file utils

        * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

        * Refactor file path handling and submission generation

        * OKVQA path

        * vizwizvqa file

        * pack cmmmu

        * fix a small metric bug for cmmmu

        * Add higher_is_better flag to submission metric

        * Add CMMMU dataset to README.md

        * Add logging and refactor submission file generation in docvqa utils.py

        * pack docvqa

        * add traceback to print detailed error

        * Refactor docvqa_test_aggregate_results to accept additional arguments

        * Add metric check in evaluator.py and update test.yaml and val.yaml

        * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

        * merge textvqa

        * textvqa

        * Modify submission file generation for COCO test results

        * Update test result storage path

        * update coco cap file name

        * Update COCO 2017 Caption dataset name

        * ferret

        * Add Ferret dataset

        * Refactor hb_doc_to_text function to include model-specific prompts

        * Add IconQA and its subtasks

        * Refactor image list creation in doc_to_visual function

        * Add process_results function to default template

        * Update process_results function in iconqa utils.py

        * refactor flickr30k

        * change aggregation function

        * Fix formatting issues and update logging message

        * Fix llava can not handle only text question (no visuals)

        * Fix qwen can not handle no image question (no visuals)

        * Add fuyu prepare accelerator scripts

        * refactor mme

        * naming consistency

        * aggregation_submissions consistency

        * flickr30k naming consistency

        * remove submissions for mme

        * remove unused submission function

        * Refactor infovqa_test.yaml and infovqa_val.yaml

        * Refactor code for improved readability and maintainability

        * stvqa

        * remane sqa

        * Update lmms_eval textcaps files and utils.py

        * Update default prompt for text captions

        * Refactor textcaps_aggregation_result function

        * Add generate_submission_file function and update mathvista_aggregate_results signature

        * Update nocaps_test.yaml and nocaps_val.yaml

        * refractor internal_eval

        * Add internal evaluation datasets

        * pack multidocvqa

        * mmvet

        * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

        * Refractor llava wild

        * Refractor llava-bench-coco

        * Add JSON file generation for gpt evaluation details

        * mmmu

        * Remove MMBench English and Chinese tasks

        * Remove unnecessary return statement in mmbench_aggregate_test_results function

        * Fix distributed process group initialization

        * Update dataset paths and group names in mmbench test configs

        * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

        * Add torch module import

        * lint

        * Remove IconQA dataset from README.md

        * Add Multi-DocVQA and its submodules

        * Add new datasets and update task names

        * Refactor flickr_aggregation_result function to accept additional arguments

        * Add timeout kwargs in Accelerator constructor

        * Add encoding to be utf-8 for cmmmu

        * Fix llava try and catch, remove torch.distributed.init in main

        * Ds prepare script for llava

        ---------

        Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit fefc964
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Feb 27 22:52:07 2024 +0800

        [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

        * Refactor logging in lmms_eval package

        * Refactor variable names in lmms_eval package

    * add llava main in pyproject

    * Update README.md

    * Remove unnecessary dependencies and add specific version for llava_repr

    * Add dependencies for llava_repr***

    * Update README.md

    * add some docs on models and command line commands

    * remove some lines

    * typo

    * Update model_guide.md

    * Update model_guide.md

    * Update README.md

    * Update README.md

    * Update README.md

    * Fix refcocog dataset path

    * Record gpt response in eval info

    * Resolve conflict

    * Fix hallusionbench gpt json saving path

    * Rename hallubench gpt output path

    * Change remove image to check by type instead of check by names

    * More robust check by type

    * Remove unnecessary img in data

    * Forcing an empty commit.

    * Testing

    * Delete unnecessary things

    * Fix seedbench2 image issue in doc_to_text

    * Add conditional exclude for internal eval

    * Fix small bugs in list_with_num

    * Revise list_with_num model args

    * Fix logging utils bug on wandb grouping

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Co-authored-by: jzhang38 <a1286225768@gmail.com>

commit 7009af6bc533534e249b3070f122d825ce738ba0
Merge: 83358a4 5e1c9c7
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 07:25:48 2024 +0000

    Merge branch 'main' into kc/final_fix

commit 44b1e7fc5570130e64269c312c11fe0244c72c87
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 07:23:19 2024 +0000

    Fix logging utils bug on wandb grouping

commit 34476c7
Author: kcz358 <92624596+kcz358@users.noreply.github.com>
Date:   Sun Mar 3 13:01:11 2024 +0800

    [Fix] refcocog dataset path, record gpt prompt in internal eval, build context issue (EvolvingLMMs-Lab#59)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

    * Update README.md with new features and installation instructions

    * Update supported models and datasets

    * Delete otter.py file

    * Fix capitalization in README.md

    * Update image sizes and add new features

    * Refactor README.md to improve readability and add new features

    * Add description for lmms-eval in README.md

    * Update accelerator support in README.md

    * Update lmms-eval README with improved description and additional features

    * Update README.md with improved task grouping description

    * change `Otter-AI/MME` to `lmms-lab/MME`

    * Update README.md

    * Update README.md

    * Remove unused code in mme.yaml

    * Squashed commit of the following:

    commit b3f1eff
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Thu Feb 29 13:40:02 2024 +0800

        Dev/py add models (EvolvingLMMs-Lab#57)

        * add instructblip

        * minicpm_v

        * remove <image> from qwen-vl

        * speed up postprocessing

        * Optimize build context speed

        ---------

        Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 0f26c8a
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Wed Feb 28 14:49:07 2024 +0800

        Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

        * refactor vizwizvqa task

        * Delete vqav2_test and vqav2_val YAML files

        * Refactor vqav2_process_results functions

        * Add a pack for vqav2

        * refactor okvqa

        * roll back vizwiz_vqa

        * Fix exact_match calculation in ok_vqa_process_results

        * Update OKVQA dataset name in readme

        * add model_specific_prompt_kwargs

        * add model_specific_prompt_kwargs to vizwiz_vqa

        * add model_specific_prompt_kwargs for vqav2

        * lint

        * fix a small bug for eval_logger

        * Refactor make_table function to display points as "  -  " if value is None

        * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

        * Refactor ok_vqa_aggreate_submissions function

        * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

        * Refactor VQA submission file saving

        * Update file utils

        * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

        * Refactor file path handling and submission generation

        * OKVQA path

        * vizwizvqa file

        * pack cmmmu

        * fix a small metric bug for cmmmu

        * Add higher_is_better flag to submission metric

        * Add CMMMU dataset to README.md

        * Add logging and refactor submission file generation in docvqa utils.py

        * pack docvqa

        * add traceback to print detailed error

        * Refactor docvqa_test_aggregate_results to accept additional arguments

        * Add metric check in evaluator.py and update test.yaml and val.yaml

        * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

        * merge textvqa

        * textvqa

        * Modify submission file generation for COCO test results

        * Update test result storage path

        * update coco cap file name

        * Update COCO 2017 Caption dataset name

        * ferret

        * Add Ferret dataset

        * Refactor hb_doc_to_text function to include model-specific prompts

        * Add IconQA and its subtasks

        * Refactor image list creation in doc_to_visual function

        * Add process_results function to default template

        * Update process_results function in iconqa utils.py

        * refactor flickr30k

        * change aggregation function

        * Fix formatting issues and update logging message

        * Fix llava can not handle only text question (no visuals)

        * Fix qwen can not handle no image question (no visuals)

        * Add fuyu prepare accelerator scripts

        * refactor mme

        * naming consistency

        * aggregation_submissions consistency

        * flickr30k naming consistency

        * remove submissions for mme

        * remove unused submission function

        * Refactor infovqa_test.yaml and infovqa_val.yaml

        * Refactor code for improved readability and maintainability

        * stvqa

        * remane sqa

        * Update lmms_eval textcaps files and utils.py

        * Update default prompt for text captions

        * Refactor textcaps_aggregation_result function

        * Add generate_submission_file function and update mathvista_aggregate_results signature

        * Update nocaps_test.yaml and nocaps_val.yaml

        * refractor internal_eval

        * Add internal evaluation datasets

        * pack multidocvqa

        * mmvet

        * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

        * Refractor llava wild

        * Refractor llava-bench-coco

        * Add JSON file generation for gpt evaluation details

        * mmmu

        * Remove MMBench English and Chinese tasks

        * Remove unnecessary return statement in mmbench_aggregate_test_results function

        * Fix distributed process group initialization

        * Update dataset paths and group names in mmbench test configs

        * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

        * Add torch module import

        * lint

        * Remove IconQA dataset from README.md

        * Add Multi-DocVQA and its submodules

        * Add new datasets and update task names

        * Refactor flickr_aggregation_result function to accept additional arguments

        * Add timeout kwargs in Accelerator constructor

        * Add encoding to be utf-8 for cmmmu

        * Fix llava try and catch, remove torch.distributed.init in main

        * Ds prepare script for llava

        ---------

        Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit fefc964
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Feb 27 22:52:07 2024 +0800

        [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

        * Refactor logging in lmms_eval package

        * Refactor variable names in lmms_eval package

    * add llava main in pyproject

    * Update README.md

    * Remove unnecessary dependencies and add specific version for llava_repr

    * Add dependencies for llava_repr***

    * Update README.md

    * add some docs on models and command line commands

    * remove some lines

    * typo

    * Update model_guide.md

    * Update model_guide.md

    * Update README.md

    * Update README.md

    * Update README.md

    * Fix refcocog dataset path

    * Record gpt response in eval info

    * Resolve conflict

    * Fix hallusionbench gpt json saving path

    * Rename hallubench gpt output path

    * Change remove image to check by type instead of check by names

    * More robust check by type

    * Remove unnecessary img in data

    * Forcing an empty commit.

    * Testing

    * Delete unnecessary things

    * Fix seedbench2 image issue in doc_to_text

    * Add conditional exclude for internal eval

    * Fix small bugs in list_with_num

    * Revise list_with_num model args

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Co-authored-by: jzhang38 <a1286225768@gmail.com>

commit be339f832f760190e81bbfbeffb7049f7cccee60
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 05:58:08 2024 +0000

    Revise list_with_num model args

commit f301a5614054538cd7c18d3ac7b1f02305e68224
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 05:09:15 2024 +0000

    Fix small bugs in list_with_num

commit 6bb0667ea746cc1dfa9442882f517edd47694d3e
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:49:36 2024 +0000

    Add conditional exclude for internal eval

commit e4ab9fc9ec7d77850ecc05bd33256909cdf62513
Merge: a3cae8e ffb9eb2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:24:29 2024 +0000

    Merge branch 'dev/readme' into kc/final_fix

commit 74c28de92a5794054d7c937b727fba3a8e5821c3
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 02:47:31 2024 +0000

    Fix seedbench2 image issue in doc_to_text

commit 279be1be1e2a839c97e58289362d6828e95e064a
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:32:49 2024 +0000

    Delete unnecessary things

commit 666f3146feef55f898f710254824d4b2c57e6747
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:31:42 2024 +0000

    Testing

commit 1f8d04d20feb6363615537ab47f8a1241c4ee692
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:29:30 2024 +0000

    Forcing an empty commit.

commit 985194e49f519ce04bdc2c0ce00eee3ab6c02def
Merge: 786f2b5 1700786
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:56 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit ef5a0a3b46acc36255c28781d8d66fc9bd32d47b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:20 2024 +0000

    Remove unnecessary img in data

commit e793fd1da7416d7938a6f9e98728692c04264a97
Merge: 4240785 888c1c1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:41:24 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit c1ae0a853bfdcc7d59e3d9fa0eaa78d4d1f01336
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:40:51 2024 +0000

    More robust check by type

commit 3ca0112d74b957f4d4ca20be5573deb8141793c7
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:00:57 2024 +0000

    Change remove image to check by type instead of check by names

commit 821398fde93ccd52eac2f4bbfb8c2e787a10b987
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 12:33:02 2024 +0000

    Rename hallubench gpt output path

commit 5172c13fb3b212c0d175987727433320a1faacbc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 09:32:52 2024 +0000

    Fix hallusionbench gpt json saving path

commit 12a243c8bee0be6ffacf17e46143519734c310d5
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:51:13 2024 +0000

    Resolve conflict

commit 2aded15347d10078c49606b690d05935ad29e6d1
Merge: 9cf86fa 93534dc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:37:21 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 9d499f198a9bdab2177bedfd3980c00934c684ff
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:55:03 2024 +0000

    Record gpt response in eval info

commit c5431b5b80cbaf6e11d840ecb1d0734d680ac41b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:49:01 2024 +0000

    Fix refcocog dataset path

commit b3f1eff
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 0f26c8a
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit fefc964
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* Update commands.md

* Add repr_scripts for reference

* Add timeout for gpt4V

* Remove unnecessary dependencies

* Add reproduce into readme

* Revise seedbench process_result

* Fix exclude dc hardcode postprocess logic error

* Fix metric repeat issue

* Update dataset runtime and add environment info

* Revise val submission file saving path

* Put the correct query into the gpt extraction

* Update sleep time in utils.py

* update

---------

Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>
Co-authored-by: jzhang38 <a1286225768@gmail.com>
Co-authored-by: kcz358 <kaichenzhang358@outlook.com>
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
* Refactor logging in lmms_eval package

* Refactor variable names in lmms_eval package

* Update README.md with new features and installation instructions

* Update supported models and datasets

* Delete otter.py file

* Fix capitalization in README.md

* Update image sizes and add new features

* Refactor README.md to improve readability and add new features

* Add description for lmms-eval in README.md

* Update accelerator support in README.md

* Update lmms-eval README with improved description and additional features

* Update README.md with improved task grouping description

* change `Otter-AI/MME` to `lmms-lab/MME`

* Update README.md

* Update README.md

* Remove unused code in mme.yaml

* Squashed commit of the following:

commit fffe545
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit c608dd6
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'b636596c46dce543cdfacc0809c5b14edafcf1fd'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit '5624cd5b92ff6b1bc1d431a615d938fd623a03c4'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '034d73b022739333da5e60f432330b8ea832ef9b'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit a0959f1
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* add llava main in pyproject

* Update README.md

* Remove unnecessary dependencies and add specific version for llava_repr

* Add dependencies for llava_repr***

* Update README.md

* add some docs on models and command line commands

* remove some lines

* typo

* Update model_guide.md

* Update model_guide.md

* Update README.md

* Update README.md

* Update README.md

* Fix refcocog dataset path

* Record gpt response in eval info

* Resolve conflict

* Fix hallusionbench gpt json saving path

* Rename hallubench gpt output path

* Change remove image to check by type instead of check by names

* More robust check by type

* Add timeout to API requests

* Remove unnecessary img in data

* Forcing an empty commit.

* Testing

* Delete unnecessary things

* Fix error logging in get_chat_response function

* Fix seedbench2 image issue in doc_to_text

* Add conditional exclude for internal eval

* Squashed commit of the following:

commit decb360fd834d968cc59dee6a06d40a326177ec5
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:49:36 2024 +0000

    Add conditional exclude for internal eval

commit 1e1ecf0de94b5e493ce0590269b3a2b9d030e31d
Merge: a3cae8e ffb9eb2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:24:29 2024 +0000

    Merge branch 'dev/readme' into kc/final_fix

commit ade2b08994f0b92f20d373cbc3cc8e2a8b665f49
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 02:47:31 2024 +0000

    Fix seedbench2 image issue in doc_to_text

commit 3bca65bca4b9b4cab80d50172dabda5c549c539f
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:32:49 2024 +0000

    Delete unnecessary things

commit 0ee12be56664eac6a79599b48ea22985f18ec358
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:31:42 2024 +0000

    Testing

commit 62cb1058ac416027ad981e3ba31ce029dfe83cf3
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:29:30 2024 +0000

    Forcing an empty commit.

commit 55447e7039321ed8d46c8dccaf75113288bdb502
Merge: 786f2b5 1700786
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:56 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 26963ddb5315e39ad9142e0fa1391fe2b8201c54
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:20 2024 +0000

    Remove unnecessary img in data

commit 7abbd695dfa73e09687a4d4f73c6bc99e63c811a
Merge: 4240785 888c1c1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:41:24 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit d1b19061661c1da1d3b7e9cd5d126ec475b6e1de
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:40:51 2024 +0000

    More robust check by type

commit 2564e74c7e8c07a51200560be70d2be13501fd9a
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:00:57 2024 +0000

    Change remove image to check by type instead of check by names

commit 4885702fcd36cfaf5bf2e498621fa0a831e8617c
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 12:33:02 2024 +0000

    Rename hallubench gpt output path

commit 46fc13424e6fecaa15d290f2330bc440ce9bd6e6
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 09:32:52 2024 +0000

    Fix hallusionbench gpt json saving path

commit 7cbeb3a05fc13fa9d0d44a17a7cd25e7550c435b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:51:13 2024 +0000

    Resolve conflict

commit e8a88505cbd71029682eaaddc8fe2c5cd41ccf5d
Merge: 9cf86fa 93534dc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:37:21 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit efc341983b959fb2cc9cc208879a86a01c251494
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:55:03 2024 +0000

    Record gpt response in eval info

commit 2b92f718f478b9f7999b17560439db366d2165a3
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:49:01 2024 +0000

    Fix refcocog dataset path

commit fffe545
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit c608dd6
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'b636596c46dce543cdfacc0809c5b14edafcf1fd'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit '5624cd5b92ff6b1bc1d431a615d938fd623a03c4'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '034d73b022739333da5e60f432330b8ea832ef9b'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit a0959f1
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* Fix small bugs in list_with_num

* Revise list_with_num model args

* Dev/readme rm rolling (EvolvingLMMs-Lab#60)

* remove log_likelyhood_rolling

* Update time efficiency benchmark in README.md

* add task guide

---------

Co-authored-by: jzhang38 <a1286225768@gmail.com>
Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>

* Remove unnecessary code and update dependencies

* Fix logging utils bug on wandb grouping

* Add reproduce envs

* Squashed commit of the following:

commit 4a1f385be0df3374ebf428599cfe35febdae0582
Merge: 2475639 f89a736
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 22:12:12 2024 +0800

    Merge branch 'main' into kc/final_fix

commit 19f7d8cd771fddd6cc6c3fee8f3c51fa4ad83eaa
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 22:11:04 2024 +0800

    Add reproduce envs

commit 1b605af
Author: kcz358 <92624596+kcz358@users.noreply.github.com>
Date:   Sun Mar 3 21:19:15 2024 +0800

    [Fix] wandb group logging missing columns (EvolvingLMMs-Lab#61)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

    * Update README.md with new features and installation instructions

    * Update supported models and datasets

    * Delete otter.py file

    * Fix capitalization in README.md

    * Update image sizes and add new features

    * Refactor README.md to improve readability and add new features

    * Add description for lmms-eval in README.md

    * Update accelerator support in README.md

    * Update lmms-eval README with improved description and additional features

    * Update README.md with improved task grouping description

    * change `Otter-AI/MME` to `lmms-lab/MME`

    * Update README.md

    * Update README.md

    * Remove unused code in mme.yaml

    * Squashed commit of the following:

    commit fffe545
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Thu Feb 29 13:40:02 2024 +0800

        Dev/py add models (EvolvingLMMs-Lab#57)

        * add instructblip

        * minicpm_v

        * remove <image> from qwen-vl

        * speed up postprocessing

        * Optimize build context speed

        ---------

        Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit c608dd6
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Wed Feb 28 14:49:07 2024 +0800

        Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

        * refactor vizwizvqa task

        * Delete vqav2_test and vqav2_val YAML files

        * Refactor vqav2_process_results functions

        * Add a pack for vqav2

        * refactor okvqa

        * roll back vizwiz_vqa

        * Fix exact_match calculation in ok_vqa_process_results

        * Update OKVQA dataset name in readme

        * add model_specific_prompt_kwargs

        * add model_specific_prompt_kwargs to vizwiz_vqa

        * add model_specific_prompt_kwargs for vqav2

        * lint

        * fix a small bug for eval_logger

        * Refactor make_table function to display points as "  -  " if value is None

        * Merge commit 'b636596c46dce543cdfacc0809c5b14edafcf1fd'

        * Refactor ok_vqa_aggreate_submissions function

        * Merge commit '5624cd5b92ff6b1bc1d431a615d938fd623a03c4'

        * Refactor VQA submission file saving

        * Update file utils

        * Merge commit '034d73b022739333da5e60f432330b8ea832ef9b'

        * Refactor file path handling and submission generation

        * OKVQA path

        * vizwizvqa file

        * pack cmmmu

        * fix a small metric bug for cmmmu

        * Add higher_is_better flag to submission metric

        * Add CMMMU dataset to README.md

        * Add logging and refactor submission file generation in docvqa utils.py

        * pack docvqa

        * add traceback to print detailed error

        * Refactor docvqa_test_aggregate_results to accept additional arguments

        * Add metric check in evaluator.py and update test.yaml and val.yaml

        * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

        * merge textvqa

        * textvqa

        * Modify submission file generation for COCO test results

        * Update test result storage path

        * update coco cap file name

        * Update COCO 2017 Caption dataset name

        * ferret

        * Add Ferret dataset

        * Refactor hb_doc_to_text function to include model-specific prompts

        * Add IconQA and its subtasks

        * Refactor image list creation in doc_to_visual function

        * Add process_results function to default template

        * Update process_results function in iconqa utils.py

        * refactor flickr30k

        * change aggregation function

        * Fix formatting issues and update logging message

        * Fix llava can not handle only text question (no visuals)

        * Fix qwen can not handle no image question (no visuals)

        * Add fuyu prepare accelerator scripts

        * refactor mme

        * naming consistency

        * aggregation_submissions consistency

        * flickr30k naming consistency

        * remove submissions for mme

        * remove unused submission function

        * Refactor infovqa_test.yaml and infovqa_val.yaml

        * Refactor code for improved readability and maintainability

        * stvqa

        * remane sqa

        * Update lmms_eval textcaps files and utils.py

        * Update default prompt for text captions

        * Refactor textcaps_aggregation_result function

        * Add generate_submission_file function and update mathvista_aggregate_results signature

        * Update nocaps_test.yaml and nocaps_val.yaml

        * refractor internal_eval

        * Add internal evaluation datasets

        * pack multidocvqa

        * mmvet

        * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

        * Refractor llava wild

        * Refractor llava-bench-coco

        * Add JSON file generation for gpt evaluation details

        * mmmu

        * Remove MMBench English and Chinese tasks

        * Remove unnecessary return statement in mmbench_aggregate_test_results function

        * Fix distributed process group initialization

        * Update dataset paths and group names in mmbench test configs

        * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

        * Add torch module import

        * lint

        * Remove IconQA dataset from README.md

        * Add Multi-DocVQA and its submodules

        * Add new datasets and update task names

        * Refactor flickr_aggregation_result function to accept additional arguments

        * Add timeout kwargs in Accelerator constructor

        * Add encoding to be utf-8 for cmmmu

        * Fix llava try and catch, remove torch.distributed.init in main

        * Ds prepare script for llava

        ---------

        Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit a0959f1
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Feb 27 22:52:07 2024 +0800

        [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

        * Refactor logging in lmms_eval package

        * Refactor variable names in lmms_eval package

    * add llava main in pyproject

    * Update README.md

    * Remove unnecessary dependencies and add specific version for llava_repr

    * Add dependencies for llava_repr***

    * Update README.md

    * add some docs on models and command line commands

    * remove some lines

    * typo

    * Update model_guide.md

    * Update model_guide.md

    * Update README.md

    * Update README.md

    * Update README.md

    * Fix refcocog dataset path

    * Record gpt response in eval info

    * Resolve conflict

    * Fix hallusionbench gpt json saving path

    * Rename hallubench gpt output path

    * Change remove image to check by type instead of check by names

    * More robust check by type

    * Remove unnecessary img in data

    * Forcing an empty commit.

    * Testing

    * Delete unnecessary things

    * Fix seedbench2 image issue in doc_to_text

    * Add conditional exclude for internal eval

    * Fix small bugs in list_with_num

    * Revise list_with_num model args

    * Fix logging utils bug on wandb grouping

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Co-authored-by: jzhang38 <a1286225768@gmail.com>

commit d1fffce8c61bd7e1e32f76c953c5b26773be58d5
Merge: 83358a4 5e1c9c7
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 07:25:48 2024 +0000

    Merge branch 'main' into kc/final_fix

commit 5a4df5d39e813844002af1a02ef4ce0c69feaa6d
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 07:23:19 2024 +0000

    Fix logging utils bug on wandb grouping

commit b923ad1
Author: kcz358 <92624596+kcz358@users.noreply.github.com>
Date:   Sun Mar 3 13:01:11 2024 +0800

    [Fix] refcocog dataset path, record gpt prompt in internal eval, build context issue (EvolvingLMMs-Lab#59)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

    * Update README.md with new features and installation instructions

    * Update supported models and datasets

    * Delete otter.py file

    * Fix capitalization in README.md

    * Update image sizes and add new features

    * Refactor README.md to improve readability and add new features

    * Add description for lmms-eval in README.md

    * Update accelerator support in README.md

    * Update lmms-eval README with improved description and additional features

    * Update README.md with improved task grouping description

    * change `Otter-AI/MME` to `lmms-lab/MME`

    * Update README.md

    * Update README.md

    * Remove unused code in mme.yaml

    * Squashed commit of the following:

    commit fffe545
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Thu Feb 29 13:40:02 2024 +0800

        Dev/py add models (EvolvingLMMs-Lab#57)

        * add instructblip

        * minicpm_v

        * remove <image> from qwen-vl

        * speed up postprocessing

        * Optimize build context speed

        ---------

        Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit c608dd6
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Wed Feb 28 14:49:07 2024 +0800

        Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

        * refactor vizwizvqa task

        * Delete vqav2_test and vqav2_val YAML files

        * Refactor vqav2_process_results functions

        * Add a pack for vqav2

        * refactor okvqa

        * roll back vizwiz_vqa

        * Fix exact_match calculation in ok_vqa_process_results

        * Update OKVQA dataset name in readme

        * add model_specific_prompt_kwargs

        * add model_specific_prompt_kwargs to vizwiz_vqa

        * add model_specific_prompt_kwargs for vqav2

        * lint

        * fix a small bug for eval_logger

        * Refactor make_table function to display points as "  -  " if value is None

        * Merge commit 'b636596c46dce543cdfacc0809c5b14edafcf1fd'

        * Refactor ok_vqa_aggreate_submissions function

        * Merge commit '5624cd5b92ff6b1bc1d431a615d938fd623a03c4'

        * Refactor VQA submission file saving

        * Update file utils

        * Merge commit '034d73b022739333da5e60f432330b8ea832ef9b'

        * Refactor file path handling and submission generation

        * OKVQA path

        * vizwizvqa file

        * pack cmmmu

        * fix a small metric bug for cmmmu

        * Add higher_is_better flag to submission metric

        * Add CMMMU dataset to README.md

        * Add logging and refactor submission file generation in docvqa utils.py

        * pack docvqa

        * add traceback to print detailed error

        * Refactor docvqa_test_aggregate_results to accept additional arguments

        * Add metric check in evaluator.py and update test.yaml and val.yaml

        * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

        * merge textvqa

        * textvqa

        * Modify submission file generation for COCO test results

        * Update test result storage path

        * update coco cap file name

        * Update COCO 2017 Caption dataset name

        * ferret

        * Add Ferret dataset

        * Refactor hb_doc_to_text function to include model-specific prompts

        * Add IconQA and its subtasks

        * Refactor image list creation in doc_to_visual function

        * Add process_results function to default template

        * Update process_results function in iconqa utils.py

        * refactor flickr30k

        * change aggregation function

        * Fix formatting issues and update logging message

        * Fix llava can not handle only text question (no visuals)

        * Fix qwen can not handle no image question (no visuals)

        * Add fuyu prepare accelerator scripts

        * refactor mme

        * naming consistency

        * aggregation_submissions consistency

        * flickr30k naming consistency

        * remove submissions for mme

        * remove unused submission function

        * Refactor infovqa_test.yaml and infovqa_val.yaml

        * Refactor code for improved readability and maintainability

        * stvqa

        * remane sqa

        * Update lmms_eval textcaps files and utils.py

        * Update default prompt for text captions

        * Refactor textcaps_aggregation_result function

        * Add generate_submission_file function and update mathvista_aggregate_results signature

        * Update nocaps_test.yaml and nocaps_val.yaml

        * refractor internal_eval

        * Add internal evaluation datasets

        * pack multidocvqa

        * mmvet

        * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

        * Refractor llava wild

        * Refractor llava-bench-coco

        * Add JSON file generation for gpt evaluation details

        * mmmu

        * Remove MMBench English and Chinese tasks

        * Remove unnecessary return statement in mmbench_aggregate_test_results function

        * Fix distributed process group initialization

        * Update dataset paths and group names in mmbench test configs

        * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

        * Add torch module import

        * lint

        * Remove IconQA dataset from README.md

        * Add Multi-DocVQA and its submodules

        * Add new datasets and update task names

        * Refactor flickr_aggregation_result function to accept additional arguments

        * Add timeout kwargs in Accelerator constructor

        * Add encoding to be utf-8 for cmmmu

        * Fix llava try and catch, remove torch.distributed.init in main

        * Ds prepare script for llava

        ---------

        Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit a0959f1
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Feb 27 22:52:07 2024 +0800

        [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

        * Refactor logging in lmms_eval package

        * Refactor variable names in lmms_eval package

    * add llava main in pyproject

    * Update README.md

    * Remove unnecessary dependencies and add specific version for llava_repr

    * Add dependencies for llava_repr***

    * Update README.md

    * add some docs on models and command line commands

    * remove some lines

    * typo

    * Update model_guide.md

    * Update model_guide.md

    * Update README.md

    * Update README.md

    * Update README.md

    * Fix refcocog dataset path

    * Record gpt response in eval info

    * Resolve conflict

    * Fix hallusionbench gpt json saving path

    * Rename hallubench gpt output path

    * Change remove image to check by type instead of check by names

    * More robust check by type

    * Remove unnecessary img in data

    * Forcing an empty commit.

    * Testing

    * Delete unnecessary things

    * Fix seedbench2 image issue in doc_to_text

    * Add conditional exclude for internal eval

    * Fix small bugs in list_with_num

    * Revise list_with_num model args

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Co-authored-by: jzhang38 <a1286225768@gmail.com>

commit 7f852ee91653357b6ee954ec92bcf2e5bab4bbcf
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 05:58:08 2024 +0000

    Revise list_with_num model args

commit 79c737c915565b191ab29113c98615a1c6acc994
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 05:09:15 2024 +0000

    Fix small bugs in list_with_num

commit decb360fd834d968cc59dee6a06d40a326177ec5
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:49:36 2024 +0000

    Add conditional exclude for internal eval

commit 1e1ecf0de94b5e493ce0590269b3a2b9d030e31d
Merge: a3cae8e ffb9eb2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:24:29 2024 +0000

    Merge branch 'dev/readme' into kc/final_fix

commit ade2b08994f0b92f20d373cbc3cc8e2a8b665f49
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 02:47:31 2024 +0000

    Fix seedbench2 image issue in doc_to_text

commit 3bca65bca4b9b4cab80d50172dabda5c549c539f
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:32:49 2024 +0000

    Delete unnecessary things

commit 0ee12be56664eac6a79599b48ea22985f18ec358
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:31:42 2024 +0000

    Testing

commit 62cb1058ac416027ad981e3ba31ce029dfe83cf3
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:29:30 2024 +0000

    Forcing an empty commit.

commit 55447e7039321ed8d46c8dccaf75113288bdb502
Merge: 786f2b5 1700786
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:56 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 26963ddb5315e39ad9142e0fa1391fe2b8201c54
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:20 2024 +0000

    Remove unnecessary img in data

commit 7abbd695dfa73e09687a4d4f73c6bc99e63c811a
Merge: 4240785 888c1c1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:41:24 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit d1b19061661c1da1d3b7e9cd5d126ec475b6e1de
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:40:51 2024 +0000

    More robust check by type

commit 2564e74c7e8c07a51200560be70d2be13501fd9a
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:00:57 2024 +0000

    Change remove image to check by type instead of check by names

commit 4885702fcd36cfaf5bf2e498621fa0a831e8617c
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 12:33:02 2024 +0000

    Rename hallubench gpt output path

commit 46fc13424e6fecaa15d290f2330bc440ce9bd6e6
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 09:32:52 2024 +0000

    Fix hallusionbench gpt json saving path

commit 7cbeb3a05fc13fa9d0d44a17a7cd25e7550c435b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:51:13 2024 +0000

    Resolve conflict

commit e8a88505cbd71029682eaaddc8fe2c5cd41ccf5d
Merge: 9cf86fa 93534dc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:37:21 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit efc341983b959fb2cc9cc208879a86a01c251494
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:55:03 2024 +0000

    Record gpt response in eval info

commit 2b92f718f478b9f7999b17560439db366d2165a3
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:49:01 2024 +0000

    Fix refcocog dataset path

commit fffe545
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit c608dd6
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'b636596c46dce543cdfacc0809c5b14edafcf1fd'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit '5624cd5b92ff6b1bc1d431a615d938fd623a03c4'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '034d73b022739333da5e60f432330b8ea832ef9b'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit a0959f1
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* Update commands.md

* Add repr_scripts for reference

* Add timeout for gpt4V

* Remove unnecessary dependencies

* Add reproduce into readme

* Revise seedbench process_result

* Fix exclude dc hardcode postprocess logic error

* Fix metric repeat issue

* Update dataset runtime and add environment info

* Revise val submission file saving path

* Put the correct query into the gpt extraction

* Update sleep time in utils.py

* update

---------

Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>
Co-authored-by: jzhang38 <a1286225768@gmail.com>
Co-authored-by: kcz358 <kaichenzhang358@outlook.com>
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
* Refactor logging in lmms_eval package

* Refactor variable names in lmms_eval package

* Update README.md with new features and installation instructions

* Update supported models and datasets

* Delete otter.py file

* Fix capitalization in README.md

* Update image sizes and add new features

* Refactor README.md to improve readability and add new features

* Add description for lmms-eval in README.md

* Update accelerator support in README.md

* Update lmms-eval README with improved description and additional features

* Update README.md with improved task grouping description

* change `Otter-AI/MME` to `lmms-lab/MME`

* Update README.md

* Update README.md

* Remove unused code in mme.yaml

* Squashed commit of the following:

commit f6a7654
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 6dbf2a9
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit cbe3e52
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* add llava main in pyproject

* Update README.md

* Remove unnecessary dependencies and add specific version for llava_repr

* Add dependencies for llava_repr***

* Update README.md

* add some docs on models and command line commands

* remove some lines

* typo

* Update model_guide.md

* Update model_guide.md

* Update README.md

* Update README.md

* Update README.md

* Fix refcocog dataset path

* Record gpt response in eval info

* Resolve conflict

* Fix hallusionbench gpt json saving path

* Rename hallubench gpt output path

* Change remove image to check by type instead of check by names

* More robust check by type

* Add timeout to API requests

* Remove unnecessary img in data

* Forcing an empty commit.

* Testing

* Delete unnecessary things

* Fix error logging in get_chat_response function

* Fix seedbench2 image issue in doc_to_text

* Add conditional exclude for internal eval

* Squashed commit of the following:

commit 74a747ff5e5a82cd8f61fb9f5a5357b67c867153
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:49:36 2024 +0000

    Add conditional exclude for internal eval

commit 336de4a8408ece3c0a2b7b5880c00b38015674a1
Merge: a3cae8e ffb9eb2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:24:29 2024 +0000

    Merge branch 'dev/readme' into kc/final_fix

commit 5860f00373890a18ed09870757bcdae9f3821aa1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 02:47:31 2024 +0000

    Fix seedbench2 image issue in doc_to_text

commit 912b73ed809e9242351874ce5b127c218188196d
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:32:49 2024 +0000

    Delete unnecessary things

commit f3f98531fc18a053b1a1bdec6c03757e1334e93b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:31:42 2024 +0000

    Testing

commit ceccc944119c22177e7fe040ba73e468dcf6d419
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:29:30 2024 +0000

    Forcing an empty commit.

commit d970b68e39068deb8308bb20af4266f4d37403df
Merge: 786f2b5 1700786
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:56 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit f0b9201adeb8e2e78886a6746ead6b585430f7d8
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:20 2024 +0000

    Remove unnecessary img in data

commit f9cdc0331bf9ef3f1cca4a3791658b2f31f300ca
Merge: 4240785 888c1c1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:41:24 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit fb4bb090b185f18b8be4ef3353ec659a40e1b508
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:40:51 2024 +0000

    More robust check by type

commit 3d58243e32f551f5427950663157c2a5ce539504
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:00:57 2024 +0000

    Change remove image to check by type instead of check by names

commit 95717b7ce70d40bc12e0b3b5809a686a083903aa
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 12:33:02 2024 +0000

    Rename hallubench gpt output path

commit 07915d5ec5d68ed0cde34bbb6e0c1438757fab72
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 09:32:52 2024 +0000

    Fix hallusionbench gpt json saving path

commit cc8ce2e48c31c5196ad5e0bca871acbe0c7492a1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:51:13 2024 +0000

    Resolve conflict

commit 562bb6c15876164ad49392df1a66ed6af84cac76
Merge: 9cf86fa 93534dc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:37:21 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit f2a585a4e5163b51dc31686a32a8aae7fd8e0751
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:55:03 2024 +0000

    Record gpt response in eval info

commit e3896d1421b5ba5794db227648ca4316a0170569
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:49:01 2024 +0000

    Fix refcocog dataset path

commit f6a7654
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 6dbf2a9
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit cbe3e52
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* Fix small bugs in list_with_num

* Revise list_with_num model args

* Dev/readme rm rolling (EvolvingLMMs-Lab#60)

* remove log_likelyhood_rolling

* Update time efficiency benchmark in README.md

* add task guide

---------

Co-authored-by: jzhang38 <a1286225768@gmail.com>
Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>

* Remove unnecessary code and update dependencies

* Fix logging utils bug on wandb grouping

* Add reproduce envs

* Squashed commit of the following:

commit d1d4ca79d569d5765080160bd8c7e8d432cadd99
Merge: 2475639 f89a736
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 22:12:12 2024 +0800

    Merge branch 'main' into kc/final_fix

commit d1815c3465e43a083ab811e8fc8602911a971413
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 22:11:04 2024 +0800

    Add reproduce envs

commit 27dbf48
Author: kcz358 <92624596+kcz358@users.noreply.github.com>
Date:   Sun Mar 3 21:19:15 2024 +0800

    [Fix] wandb group logging missing columns (EvolvingLMMs-Lab#61)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

    * Update README.md with new features and installation instructions

    * Update supported models and datasets

    * Delete otter.py file

    * Fix capitalization in README.md

    * Update image sizes and add new features

    * Refactor README.md to improve readability and add new features

    * Add description for lmms-eval in README.md

    * Update accelerator support in README.md

    * Update lmms-eval README with improved description and additional features

    * Update README.md with improved task grouping description

    * change `Otter-AI/MME` to `lmms-lab/MME`

    * Update README.md

    * Update README.md

    * Remove unused code in mme.yaml

    * Squashed commit of the following:

    commit f6a7654
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Thu Feb 29 13:40:02 2024 +0800

        Dev/py add models (EvolvingLMMs-Lab#57)

        * add instructblip

        * minicpm_v

        * remove <image> from qwen-vl

        * speed up postprocessing

        * Optimize build context speed

        ---------

        Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 6dbf2a9
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Wed Feb 28 14:49:07 2024 +0800

        Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

        * refactor vizwizvqa task

        * Delete vqav2_test and vqav2_val YAML files

        * Refactor vqav2_process_results functions

        * Add a pack for vqav2

        * refactor okvqa

        * roll back vizwiz_vqa

        * Fix exact_match calculation in ok_vqa_process_results

        * Update OKVQA dataset name in readme

        * add model_specific_prompt_kwargs

        * add model_specific_prompt_kwargs to vizwiz_vqa

        * add model_specific_prompt_kwargs for vqav2

        * lint

        * fix a small bug for eval_logger

        * Refactor make_table function to display points as "  -  " if value is None

        * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

        * Refactor ok_vqa_aggreate_submissions function

        * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

        * Refactor VQA submission file saving

        * Update file utils

        * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

        * Refactor file path handling and submission generation

        * OKVQA path

        * vizwizvqa file

        * pack cmmmu

        * fix a small metric bug for cmmmu

        * Add higher_is_better flag to submission metric

        * Add CMMMU dataset to README.md

        * Add logging and refactor submission file generation in docvqa utils.py

        * pack docvqa

        * add traceback to print detailed error

        * Refactor docvqa_test_aggregate_results to accept additional arguments

        * Add metric check in evaluator.py and update test.yaml and val.yaml

        * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

        * merge textvqa

        * textvqa

        * Modify submission file generation for COCO test results

        * Update test result storage path

        * update coco cap file name

        * Update COCO 2017 Caption dataset name

        * ferret

        * Add Ferret dataset

        * Refactor hb_doc_to_text function to include model-specific prompts

        * Add IconQA and its subtasks

        * Refactor image list creation in doc_to_visual function

        * Add process_results function to default template

        * Update process_results function in iconqa utils.py

        * refactor flickr30k

        * change aggregation function

        * Fix formatting issues and update logging message

        * Fix llava can not handle only text question (no visuals)

        * Fix qwen can not handle no image question (no visuals)

        * Add fuyu prepare accelerator scripts

        * refactor mme

        * naming consistency

        * aggregation_submissions consistency

        * flickr30k naming consistency

        * remove submissions for mme

        * remove unused submission function

        * Refactor infovqa_test.yaml and infovqa_val.yaml

        * Refactor code for improved readability and maintainability

        * stvqa

        * remane sqa

        * Update lmms_eval textcaps files and utils.py

        * Update default prompt for text captions

        * Refactor textcaps_aggregation_result function

        * Add generate_submission_file function and update mathvista_aggregate_results signature

        * Update nocaps_test.yaml and nocaps_val.yaml

        * refractor internal_eval

        * Add internal evaluation datasets

        * pack multidocvqa

        * mmvet

        * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

        * Refractor llava wild

        * Refractor llava-bench-coco

        * Add JSON file generation for gpt evaluation details

        * mmmu

        * Remove MMBench English and Chinese tasks

        * Remove unnecessary return statement in mmbench_aggregate_test_results function

        * Fix distributed process group initialization

        * Update dataset paths and group names in mmbench test configs

        * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

        * Add torch module import

        * lint

        * Remove IconQA dataset from README.md

        * Add Multi-DocVQA and its submodules

        * Add new datasets and update task names

        * Refactor flickr_aggregation_result function to accept additional arguments

        * Add timeout kwargs in Accelerator constructor

        * Add encoding to be utf-8 for cmmmu

        * Fix llava try and catch, remove torch.distributed.init in main

        * Ds prepare script for llava

        ---------

        Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit cbe3e52
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Feb 27 22:52:07 2024 +0800

        [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

        * Refactor logging in lmms_eval package

        * Refactor variable names in lmms_eval package

    * add llava main in pyproject

    * Update README.md

    * Remove unnecessary dependencies and add specific version for llava_repr

    * Add dependencies for llava_repr***

    * Update README.md

    * add some docs on models and command line commands

    * remove some lines

    * typo

    * Update model_guide.md

    * Update model_guide.md

    * Update README.md

    * Update README.md

    * Update README.md

    * Fix refcocog dataset path

    * Record gpt response in eval info

    * Resolve conflict

    * Fix hallusionbench gpt json saving path

    * Rename hallubench gpt output path

    * Change remove image to check by type instead of check by names

    * More robust check by type

    * Remove unnecessary img in data

    * Forcing an empty commit.

    * Testing

    * Delete unnecessary things

    * Fix seedbench2 image issue in doc_to_text

    * Add conditional exclude for internal eval

    * Fix small bugs in list_with_num

    * Revise list_with_num model args

    * Fix logging utils bug on wandb grouping

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Co-authored-by: jzhang38 <a1286225768@gmail.com>

commit bf67bcc02cb57e63952e4429515269458084ea5f
Merge: 83358a4 5e1c9c7
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 07:25:48 2024 +0000

    Merge branch 'main' into kc/final_fix

commit c3e54461dd77f62aa50bcee8fbbebc14e4470644
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 07:23:19 2024 +0000

    Fix logging utils bug on wandb grouping

commit 2a94fb0
Author: kcz358 <92624596+kcz358@users.noreply.github.com>
Date:   Sun Mar 3 13:01:11 2024 +0800

    [Fix] refcocog dataset path, record gpt prompt in internal eval, build context issue (EvolvingLMMs-Lab#59)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

    * Update README.md with new features and installation instructions

    * Update supported models and datasets

    * Delete otter.py file

    * Fix capitalization in README.md

    * Update image sizes and add new features

    * Refactor README.md to improve readability and add new features

    * Add description for lmms-eval in README.md

    * Update accelerator support in README.md

    * Update lmms-eval README with improved description and additional features

    * Update README.md with improved task grouping description

    * change `Otter-AI/MME` to `lmms-lab/MME`

    * Update README.md

    * Update README.md

    * Remove unused code in mme.yaml

    * Squashed commit of the following:

    commit f6a7654
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Thu Feb 29 13:40:02 2024 +0800

        Dev/py add models (EvolvingLMMs-Lab#57)

        * add instructblip

        * minicpm_v

        * remove <image> from qwen-vl

        * speed up postprocessing

        * Optimize build context speed

        ---------

        Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 6dbf2a9
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Wed Feb 28 14:49:07 2024 +0800

        Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

        * refactor vizwizvqa task

        * Delete vqav2_test and vqav2_val YAML files

        * Refactor vqav2_process_results functions

        * Add a pack for vqav2

        * refactor okvqa

        * roll back vizwiz_vqa

        * Fix exact_match calculation in ok_vqa_process_results

        * Update OKVQA dataset name in readme

        * add model_specific_prompt_kwargs

        * add model_specific_prompt_kwargs to vizwiz_vqa

        * add model_specific_prompt_kwargs for vqav2

        * lint

        * fix a small bug for eval_logger

        * Refactor make_table function to display points as "  -  " if value is None

        * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

        * Refactor ok_vqa_aggreate_submissions function

        * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

        * Refactor VQA submission file saving

        * Update file utils

        * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

        * Refactor file path handling and submission generation

        * OKVQA path

        * vizwizvqa file

        * pack cmmmu

        * fix a small metric bug for cmmmu

        * Add higher_is_better flag to submission metric

        * Add CMMMU dataset to README.md

        * Add logging and refactor submission file generation in docvqa utils.py

        * pack docvqa

        * add traceback to print detailed error

        * Refactor docvqa_test_aggregate_results to accept additional arguments

        * Add metric check in evaluator.py and update test.yaml and val.yaml

        * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

        * merge textvqa

        * textvqa

        * Modify submission file generation for COCO test results

        * Update test result storage path

        * update coco cap file name

        * Update COCO 2017 Caption dataset name

        * ferret

        * Add Ferret dataset

        * Refactor hb_doc_to_text function to include model-specific prompts

        * Add IconQA and its subtasks

        * Refactor image list creation in doc_to_visual function

        * Add process_results function to default template

        * Update process_results function in iconqa utils.py

        * refactor flickr30k

        * change aggregation function

        * Fix formatting issues and update logging message

        * Fix llava can not handle only text question (no visuals)

        * Fix qwen can not handle no image question (no visuals)

        * Add fuyu prepare accelerator scripts

        * refactor mme

        * naming consistency

        * aggregation_submissions consistency

        * flickr30k naming consistency

        * remove submissions for mme

        * remove unused submission function

        * Refactor infovqa_test.yaml and infovqa_val.yaml

        * Refactor code for improved readability and maintainability

        * stvqa

        * remane sqa

        * Update lmms_eval textcaps files and utils.py

        * Update default prompt for text captions

        * Refactor textcaps_aggregation_result function

        * Add generate_submission_file function and update mathvista_aggregate_results signature

        * Update nocaps_test.yaml and nocaps_val.yaml

        * refractor internal_eval

        * Add internal evaluation datasets

        * pack multidocvqa

        * mmvet

        * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

        * Refractor llava wild

        * Refractor llava-bench-coco

        * Add JSON file generation for gpt evaluation details

        * mmmu

        * Remove MMBench English and Chinese tasks

        * Remove unnecessary return statement in mmbench_aggregate_test_results function

        * Fix distributed process group initialization

        * Update dataset paths and group names in mmbench test configs

        * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

        * Add torch module import

        * lint

        * Remove IconQA dataset from README.md

        * Add Multi-DocVQA and its submodules

        * Add new datasets and update task names

        * Refactor flickr_aggregation_result function to accept additional arguments

        * Add timeout kwargs in Accelerator constructor

        * Add encoding to be utf-8 for cmmmu

        * Fix llava try and catch, remove torch.distributed.init in main

        * Ds prepare script for llava

        ---------

        Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit cbe3e52
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Feb 27 22:52:07 2024 +0800

        [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

        * Refactor logging in lmms_eval package

        * Refactor variable names in lmms_eval package

    * add llava main in pyproject

    * Update README.md

    * Remove unnecessary dependencies and add specific version for llava_repr

    * Add dependencies for llava_repr***

    * Update README.md

    * add some docs on models and command line commands

    * remove some lines

    * typo

    * Update model_guide.md

    * Update model_guide.md

    * Update README.md

    * Update README.md

    * Update README.md

    * Fix refcocog dataset path

    * Record gpt response in eval info

    * Resolve conflict

    * Fix hallusionbench gpt json saving path

    * Rename hallubench gpt output path

    * Change remove image to check by type instead of check by names

    * More robust check by type

    * Remove unnecessary img in data

    * Forcing an empty commit.

    * Testing

    * Delete unnecessary things

    * Fix seedbench2 image issue in doc_to_text

    * Add conditional exclude for internal eval

    * Fix small bugs in list_with_num

    * Revise list_with_num model args

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Co-authored-by: jzhang38 <a1286225768@gmail.com>

commit a0ce88c84a9122b793a6b6d352896767fed1f18a
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 05:58:08 2024 +0000

    Revise list_with_num model args

commit b892d8eac7f656fafa5d6425b94b3d089e4a5268
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 05:09:15 2024 +0000

    Fix small bugs in list_with_num

commit 74a747ff5e5a82cd8f61fb9f5a5357b67c867153
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:49:36 2024 +0000

    Add conditional exclude for internal eval

commit 336de4a8408ece3c0a2b7b5880c00b38015674a1
Merge: a3cae8e ffb9eb2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:24:29 2024 +0000

    Merge branch 'dev/readme' into kc/final_fix

commit 5860f00373890a18ed09870757bcdae9f3821aa1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 02:47:31 2024 +0000

    Fix seedbench2 image issue in doc_to_text

commit 912b73ed809e9242351874ce5b127c218188196d
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:32:49 2024 +0000

    Delete unnecessary things

commit f3f98531fc18a053b1a1bdec6c03757e1334e93b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:31:42 2024 +0000

    Testing

commit ceccc944119c22177e7fe040ba73e468dcf6d419
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:29:30 2024 +0000

    Forcing an empty commit.

commit d970b68e39068deb8308bb20af4266f4d37403df
Merge: 786f2b5 1700786
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:56 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit f0b9201adeb8e2e78886a6746ead6b585430f7d8
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:20 2024 +0000

    Remove unnecessary img in data

commit f9cdc0331bf9ef3f1cca4a3791658b2f31f300ca
Merge: 4240785 888c1c1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:41:24 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit fb4bb090b185f18b8be4ef3353ec659a40e1b508
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:40:51 2024 +0000

    More robust check by type

commit 3d58243e32f551f5427950663157c2a5ce539504
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:00:57 2024 +0000

    Change remove image to check by type instead of check by names

commit 95717b7ce70d40bc12e0b3b5809a686a083903aa
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 12:33:02 2024 +0000

    Rename hallubench gpt output path

commit 07915d5ec5d68ed0cde34bbb6e0c1438757fab72
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 09:32:52 2024 +0000

    Fix hallusionbench gpt json saving path

commit cc8ce2e48c31c5196ad5e0bca871acbe0c7492a1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:51:13 2024 +0000

    Resolve conflict

commit 562bb6c15876164ad49392df1a66ed6af84cac76
Merge: 9cf86fa 93534dc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:37:21 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit f2a585a4e5163b51dc31686a32a8aae7fd8e0751
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:55:03 2024 +0000

    Record gpt response in eval info

commit e3896d1421b5ba5794db227648ca4316a0170569
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:49:01 2024 +0000

    Fix refcocog dataset path

commit f6a7654
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 6dbf2a9
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit cbe3e52
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* Update commands.md

* Add repr_scripts for reference

* Add timeout for gpt4V

* Remove unnecessary dependencies

* Add reproduce into readme

* Revise seedbench process_result

* Fix exclude dc hardcode postprocess logic error

* Fix metric repeat issue

* Update dataset runtime and add environment info

* Revise val submission file saving path

* Put the correct query into the gpt extraction

* Update sleep time in utils.py

* update

---------

Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>
Co-authored-by: jzhang38 <a1286225768@gmail.com>
Co-authored-by: kcz358 <kaichenzhang358@outlook.com>
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
* Refactor logging in lmms_eval package

* Refactor variable names in lmms_eval package

* Update README.md with new features and installation instructions

* Update supported models and datasets

* Delete otter.py file

* Fix capitalization in README.md

* Update image sizes and add new features

* Refactor README.md to improve readability and add new features

* Add description for lmms-eval in README.md

* Update accelerator support in README.md

* Update lmms-eval README with improved description and additional features

* Update README.md with improved task grouping description

* change `Otter-AI/MME` to `lmms-lab/MME`

* Update README.md

* Update README.md

* Remove unused code in mme.yaml

* Squashed commit of the following:

commit a68962a
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 0b02105
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit '5e73e8b8a2408bd8193361788669ca80db19cb04'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit '40099e8b8145bde513b9b7cef8461d8f13d1eafe'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit 'a56fe11c00ad4a8b8967be88b93baef6649528c5'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit f4af7d0
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* add llava main in pyproject

* Update README.md

* Remove unnecessary dependencies and add specific version for llava_repr

* Add dependencies for llava_repr***

* Update README.md

* add some docs on models and command line commands

* remove some lines

* typo

* Update model_guide.md

* Update model_guide.md

* Update README.md

* Update README.md

* Update README.md

* Fix refcocog dataset path

* Record gpt response in eval info

* Resolve conflict

* Fix hallusionbench gpt json saving path

* Rename hallubench gpt output path

* Change remove image to check by type instead of check by names

* More robust check by type

* Add timeout to API requests

* Remove unnecessary img in data

* Forcing an empty commit.

* Testing

* Delete unnecessary things

* Fix error logging in get_chat_response function

* Fix seedbench2 image issue in doc_to_text

* Add conditional exclude for internal eval

* Squashed commit of the following:

commit 1cf38b3ad6c7799957901d836299243cc21718f5
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:49:36 2024 +0000

    Add conditional exclude for internal eval

commit 62527c874431508b7731ad49ff1f1526104703cd
Merge: a3cae8e ffb9eb2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:24:29 2024 +0000

    Merge branch 'dev/readme' into kc/final_fix

commit 522f36aca8354f5efa7fff6d23bd90e885bcf1ab
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 02:47:31 2024 +0000

    Fix seedbench2 image issue in doc_to_text

commit 4ee323a5b19382dbd9ba62f5002042d0746c374e
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:32:49 2024 +0000

    Delete unnecessary things

commit 3d3e164489cb4bd2db342ae085da9613ee7de660
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:31:42 2024 +0000

    Testing

commit 8a4f586d7232a4d89977cef140900728d4517b72
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:29:30 2024 +0000

    Forcing an empty commit.

commit 33dd5b0e0006882e735b7ea1908fdb6ad37c825a
Merge: 786f2b5 1700786
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:56 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit f19de3e7aaf5151d5ce9c63a2b9ee393c6282dfa
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:20 2024 +0000

    Remove unnecessary img in data

commit e1f8cad15ddc2e385a3f2a778a4af57e1072987c
Merge: 4240785 888c1c1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:41:24 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 472b6b1ed2d5bc10ff1d6df8e435f33dc821ad4b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:40:51 2024 +0000

    More robust check by type

commit 367c021bd50068baf024bea3afde4ed58aa38b81
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:00:57 2024 +0000

    Change remove image to check by type instead of check by names

commit 0a466e16d983392cbf0580733500c0890521df93
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 12:33:02 2024 +0000

    Rename hallubench gpt output path

commit 6feceda2c1d631243c78fd7805dcdde4d0e8912f
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 09:32:52 2024 +0000

    Fix hallusionbench gpt json saving path

commit db1f731ee5aff4618edefed018e982f83add0c9a
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:51:13 2024 +0000

    Resolve conflict

commit c8a5e1129310ed1ce1fd86f43bb49da701140383
Merge: 9cf86fa 93534dc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:37:21 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit de53ceaeff08dc7c01962c704e06d7b87f804ec7
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:55:03 2024 +0000

    Record gpt response in eval info

commit e372631e911f2e03cc4f579e291e1198c4c11298
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:49:01 2024 +0000

    Fix refcocog dataset path

commit a68962a
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 0b02105
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit '5e73e8b8a2408bd8193361788669ca80db19cb04'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit '40099e8b8145bde513b9b7cef8461d8f13d1eafe'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit 'a56fe11c00ad4a8b8967be88b93baef6649528c5'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit f4af7d0
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* Fix small bugs in list_with_num

* Revise list_with_num model args

* Dev/readme rm rolling (EvolvingLMMs-Lab#60)

* remove log_likelyhood_rolling

* Update time efficiency benchmark in README.md

* add task guide

---------

Co-authored-by: jzhang38 <a1286225768@gmail.com>
Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>

* Remove unnecessary code and update dependencies

* Fix logging utils bug on wandb grouping

* Add reproduce envs

* Squashed commit of the following:

commit cf18d7a1300311ffe1c9671fff7fa0c0d1cf2476
Merge: 2475639 f89a736
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 22:12:12 2024 +0800

    Merge branch 'main' into kc/final_fix

commit 35e5a937bcf924d6b787ce37c6da9f0f54674da9
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 22:11:04 2024 +0800

    Add reproduce envs

commit 3c741a5
Author: kcz358 <92624596+kcz358@users.noreply.github.com>
Date:   Sun Mar 3 21:19:15 2024 +0800

    [Fix] wandb group logging missing columns (EvolvingLMMs-Lab#61)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

    * Update README.md with new features and installation instructions

    * Update supported models and datasets

    * Delete otter.py file

    * Fix capitalization in README.md

    * Update image sizes and add new features

    * Refactor README.md to improve readability and add new features

    * Add description for lmms-eval in README.md

    * Update accelerator support in README.md

    * Update lmms-eval README with improved description and additional features

    * Update README.md with improved task grouping description

    * change `Otter-AI/MME` to `lmms-lab/MME`

    * Update README.md

    * Update README.md

    * Remove unused code in mme.yaml

    * Squashed commit of the following:

    commit a68962a
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Thu Feb 29 13:40:02 2024 +0800

        Dev/py add models (EvolvingLMMs-Lab#57)

        * add instructblip

        * minicpm_v

        * remove <image> from qwen-vl

        * speed up postprocessing

        * Optimize build context speed

        ---------

        Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 0b02105
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Wed Feb 28 14:49:07 2024 +0800

        Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

        * refactor vizwizvqa task

        * Delete vqav2_test and vqav2_val YAML files

        * Refactor vqav2_process_results functions

        * Add a pack for vqav2

        * refactor okvqa

        * roll back vizwiz_vqa

        * Fix exact_match calculation in ok_vqa_process_results

        * Update OKVQA dataset name in readme

        * add model_specific_prompt_kwargs

        * add model_specific_prompt_kwargs to vizwiz_vqa

        * add model_specific_prompt_kwargs for vqav2

        * lint

        * fix a small bug for eval_logger

        * Refactor make_table function to display points as "  -  " if value is None

        * Merge commit '5e73e8b8a2408bd8193361788669ca80db19cb04'

        * Refactor ok_vqa_aggreate_submissions function

        * Merge commit '40099e8b8145bde513b9b7cef8461d8f13d1eafe'

        * Refactor VQA submission file saving

        * Update file utils

        * Merge commit 'a56fe11c00ad4a8b8967be88b93baef6649528c5'

        * Refactor file path handling and submission generation

        * OKVQA path

        * vizwizvqa file

        * pack cmmmu

        * fix a small metric bug for cmmmu

        * Add higher_is_better flag to submission metric

        * Add CMMMU dataset to README.md

        * Add logging and refactor submission file generation in docvqa utils.py

        * pack docvqa

        * add traceback to print detailed error

        * Refactor docvqa_test_aggregate_results to accept additional arguments

        * Add metric check in evaluator.py and update test.yaml and val.yaml

        * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

        * merge textvqa

        * textvqa

        * Modify submission file generation for COCO test results

        * Update test result storage path

        * update coco cap file name

        * Update COCO 2017 Caption dataset name

        * ferret

        * Add Ferret dataset

        * Refactor hb_doc_to_text function to include model-specific prompts

        * Add IconQA and its subtasks

        * Refactor image list creation in doc_to_visual function

        * Add process_results function to default template

        * Update process_results function in iconqa utils.py

        * refactor flickr30k

        * change aggregation function

        * Fix formatting issues and update logging message

        * Fix llava can not handle only text question (no visuals)

        * Fix qwen can not handle no image question (no visuals)

        * Add fuyu prepare accelerator scripts

        * refactor mme

        * naming consistency

        * aggregation_submissions consistency

        * flickr30k naming consistency

        * remove submissions for mme

        * remove unused submission function

        * Refactor infovqa_test.yaml and infovqa_val.yaml

        * Refactor code for improved readability and maintainability

        * stvqa

        * remane sqa

        * Update lmms_eval textcaps files and utils.py

        * Update default prompt for text captions

        * Refactor textcaps_aggregation_result function

        * Add generate_submission_file function and update mathvista_aggregate_results signature

        * Update nocaps_test.yaml and nocaps_val.yaml

        * refractor internal_eval

        * Add internal evaluation datasets

        * pack multidocvqa

        * mmvet

        * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

        * Refractor llava wild

        * Refractor llava-bench-coco

        * Add JSON file generation for gpt evaluation details

        * mmmu

        * Remove MMBench English and Chinese tasks

        * Remove unnecessary return statement in mmbench_aggregate_test_results function

        * Fix distributed process group initialization

        * Update dataset paths and group names in mmbench test configs

        * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

        * Add torch module import

        * lint

        * Remove IconQA dataset from README.md

        * Add Multi-DocVQA and its submodules

        * Add new datasets and update task names

        * Refactor flickr_aggregation_result function to accept additional arguments

        * Add timeout kwargs in Accelerator constructor

        * Add encoding to be utf-8 for cmmmu

        * Fix llava try and catch, remove torch.distributed.init in main

        * Ds prepare script for llava

        ---------

        Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit f4af7d0
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Feb 27 22:52:07 2024 +0800

        [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

        * Refactor logging in lmms_eval package

        * Refactor variable names in lmms_eval package

    * add llava main in pyproject

    * Update README.md

    * Remove unnecessary dependencies and add specific version for llava_repr

    * Add dependencies for llava_repr***

    * Update README.md

    * add some docs on models and command line commands

    * remove some lines

    * typo

    * Update model_guide.md

    * Update model_guide.md

    * Update README.md

    * Update README.md

    * Update README.md

    * Fix refcocog dataset path

    * Record gpt response in eval info

    * Resolve conflict

    * Fix hallusionbench gpt json saving path

    * Rename hallubench gpt output path

    * Change remove image to check by type instead of check by names

    * More robust check by type

    * Remove unnecessary img in data

    * Forcing an empty commit.

    * Testing

    * Delete unnecessary things

    * Fix seedbench2 image issue in doc_to_text

    * Add conditional exclude for internal eval

    * Fix small bugs in list_with_num

    * Revise list_with_num model args

    * Fix logging utils bug on wandb grouping

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Co-authored-by: jzhang38 <a1286225768@gmail.com>

commit 39ce670fb1992c5e30d4b0eff9636a88a1ce83f5
Merge: 83358a4 5e1c9c7
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 07:25:48 2024 +0000

    Merge branch 'main' into kc/final_fix

commit 36eeaa08730cd3e6a7e90e7000f61b4ebb075524
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 07:23:19 2024 +0000

    Fix logging utils bug on wandb grouping

commit 9eb42de
Author: kcz358 <92624596+kcz358@users.noreply.github.com>
Date:   Sun Mar 3 13:01:11 2024 +0800

    [Fix] refcocog dataset path, record gpt prompt in internal eval, build context issue (EvolvingLMMs-Lab#59)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

    * Update README.md with new features and installation instructions

    * Update supported models and datasets

    * Delete otter.py file

    * Fix capitalization in README.md

    * Update image sizes and add new features

    * Refactor README.md to improve readability and add new features

    * Add description for lmms-eval in README.md

    * Update accelerator support in README.md

    * Update lmms-eval README with improved description and additional features

    * Update README.md with improved task grouping description

    * change `Otter-AI/MME` to `lmms-lab/MME`

    * Update README.md

    * Update README.md

    * Remove unused code in mme.yaml

    * Squashed commit of the following:

    commit a68962a
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Thu Feb 29 13:40:02 2024 +0800

        Dev/py add models (EvolvingLMMs-Lab#57)

        * add instructblip

        * minicpm_v

        * remove <image> from qwen-vl

        * speed up postprocessing

        * Optimize build context speed

        ---------

        Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 0b02105
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Wed Feb 28 14:49:07 2024 +0800

        Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

        * refactor vizwizvqa task

        * Delete vqav2_test and vqav2_val YAML files

        * Refactor vqav2_process_results functions

        * Add a pack for vqav2

        * refactor okvqa

        * roll back vizwiz_vqa

        * Fix exact_match calculation in ok_vqa_process_results

        * Update OKVQA dataset name in readme

        * add model_specific_prompt_kwargs

        * add model_specific_prompt_kwargs to vizwiz_vqa

        * add model_specific_prompt_kwargs for vqav2

        * lint

        * fix a small bug for eval_logger

        * Refactor make_table function to display points as "  -  " if value is None

        * Merge commit '5e73e8b8a2408bd8193361788669ca80db19cb04'

        * Refactor ok_vqa_aggreate_submissions function

        * Merge commit '40099e8b8145bde513b9b7cef8461d8f13d1eafe'

        * Refactor VQA submission file saving

        * Update file utils

        * Merge commit 'a56fe11c00ad4a8b8967be88b93baef6649528c5'

        * Refactor file path handling and submission generation

        * OKVQA path

        * vizwizvqa file

        * pack cmmmu

        * fix a small metric bug for cmmmu

        * Add higher_is_better flag to submission metric

        * Add CMMMU dataset to README.md

        * Add logging and refactor submission file generation in docvqa utils.py

        * pack docvqa

        * add traceback to print detailed error

        * Refactor docvqa_test_aggregate_results to accept additional arguments

        * Add metric check in evaluator.py and update test.yaml and val.yaml

        * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

        * merge textvqa

        * textvqa

        * Modify submission file generation for COCO test results

        * Update test result storage path

        * update coco cap file name

        * Update COCO 2017 Caption dataset name

        * ferret

        * Add Ferret dataset

        * Refactor hb_doc_to_text function to include model-specific prompts

        * Add IconQA and its subtasks

        * Refactor image list creation in doc_to_visual function

        * Add process_results function to default template

        * Update process_results function in iconqa utils.py

        * refactor flickr30k

        * change aggregation function

        * Fix formatting issues and update logging message

        * Fix llava can not handle only text question (no visuals)

        * Fix qwen can not handle no image question (no visuals)

        * Add fuyu prepare accelerator scripts

        * refactor mme

        * naming consistency

        * aggregation_submissions consistency

        * flickr30k naming consistency

        * remove submissions for mme

        * remove unused submission function

        * Refactor infovqa_test.yaml and infovqa_val.yaml

        * Refactor code for improved readability and maintainability

        * stvqa

        * remane sqa

        * Update lmms_eval textcaps files and utils.py

        * Update default prompt for text captions

        * Refactor textcaps_aggregation_result function

        * Add generate_submission_file function and update mathvista_aggregate_results signature

        * Update nocaps_test.yaml and nocaps_val.yaml

        * refractor internal_eval

        * Add internal evaluation datasets

        * pack multidocvqa

        * mmvet

        * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

        * Refractor llava wild

        * Refractor llava-bench-coco

        * Add JSON file generation for gpt evaluation details

        * mmmu

        * Remove MMBench English and Chinese tasks

        * Remove unnecessary return statement in mmbench_aggregate_test_results function

        * Fix distributed process group initialization

        * Update dataset paths and group names in mmbench test configs

        * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

        * Add torch module import

        * lint

        * Remove IconQA dataset from README.md

        * Add Multi-DocVQA and its submodules

        * Add new datasets and update task names

        * Refactor flickr_aggregation_result function to accept additional arguments

        * Add timeout kwargs in Accelerator constructor

        * Add encoding to be utf-8 for cmmmu

        * Fix llava try and catch, remove torch.distributed.init in main

        * Ds prepare script for llava

        ---------

        Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit f4af7d0
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Feb 27 22:52:07 2024 +0800

        [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

        * Refactor logging in lmms_eval package

        * Refactor variable names in lmms_eval package

    * add llava main in pyproject

    * Update README.md

    * Remove unnecessary dependencies and add specific version for llava_repr

    * Add dependencies for llava_repr***

    * Update README.md

    * add some docs on models and command line commands

    * remove some lines

    * typo

    * Update model_guide.md

    * Update model_guide.md

    * Update README.md

    * Update README.md

    * Update README.md

    * Fix refcocog dataset path

    * Record gpt response in eval info

    * Resolve conflict

    * Fix hallusionbench gpt json saving path

    * Rename hallubench gpt output path

    * Change remove image to check by type instead of check by names

    * More robust check by type

    * Remove unnecessary img in data

    * Forcing an empty commit.

    * Testing

    * Delete unnecessary things

    * Fix seedbench2 image issue in doc_to_text

    * Add conditional exclude for internal eval

    * Fix small bugs in list_with_num

    * Revise list_with_num model args

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Co-authored-by: jzhang38 <a1286225768@gmail.com>

commit 22fda28d8aa2a53405f15d179ea9baaf53a19b0b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 05:58:08 2024 +0000

    Revise list_with_num model args

commit 48d92eb823b7929ea4c7b0da9f2284ec194c71cf
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 05:09:15 2024 +0000

    Fix small bugs in list_with_num

commit 1cf38b3ad6c7799957901d836299243cc21718f5
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:49:36 2024 +0000

    Add conditional exclude for internal eval

commit 62527c874431508b7731ad49ff1f1526104703cd
Merge: a3cae8e ffb9eb2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:24:29 2024 +0000

    Merge branch 'dev/readme' into kc/final_fix

commit 522f36aca8354f5efa7fff6d23bd90e885bcf1ab
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 02:47:31 2024 +0000

    Fix seedbench2 image issue in doc_to_text

commit 4ee323a5b19382dbd9ba62f5002042d0746c374e
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:32:49 2024 +0000

    Delete unnecessary things

commit 3d3e164489cb4bd2db342ae085da9613ee7de660
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:31:42 2024 +0000

    Testing

commit 8a4f586d7232a4d89977cef140900728d4517b72
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:29:30 2024 +0000

    Forcing an empty commit.

commit 33dd5b0e0006882e735b7ea1908fdb6ad37c825a
Merge: 786f2b5 1700786
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:56 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit f19de3e7aaf5151d5ce9c63a2b9ee393c6282dfa
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:20 2024 +0000

    Remove unnecessary img in data

commit e1f8cad15ddc2e385a3f2a778a4af57e1072987c
Merge: 4240785 888c1c1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:41:24 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 472b6b1ed2d5bc10ff1d6df8e435f33dc821ad4b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:40:51 2024 +0000

    More robust check by type

commit 367c021bd50068baf024bea3afde4ed58aa38b81
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:00:57 2024 +0000

    Change remove image to check by type instead of check by names

commit 0a466e16d983392cbf0580733500c0890521df93
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 12:33:02 2024 +0000

    Rename hallubench gpt output path

commit 6feceda2c1d631243c78fd7805dcdde4d0e8912f
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 09:32:52 2024 +0000

    Fix hallusionbench gpt json saving path

commit db1f731ee5aff4618edefed018e982f83add0c9a
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:51:13 2024 +0000

    Resolve conflict

commit c8a5e1129310ed1ce1fd86f43bb49da701140383
Merge: 9cf86fa 93534dc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:37:21 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit de53ceaeff08dc7c01962c704e06d7b87f804ec7
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:55:03 2024 +0000

    Record gpt response in eval info

commit e372631e911f2e03cc4f579e291e1198c4c11298
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:49:01 2024 +0000

    Fix refcocog dataset path

commit a68962a
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 0b02105
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit '5e73e8b8a2408bd8193361788669ca80db19cb04'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit '40099e8b8145bde513b9b7cef8461d8f13d1eafe'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit 'a56fe11c00ad4a8b8967be88b93baef6649528c5'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit f4af7d0
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* Update commands.md

* Add repr_scripts for reference

* Add timeout for gpt4V

* Remove unnecessary dependencies

* Add reproduce into readme

* Revise seedbench process_result

* Fix exclude dc hardcode postprocess logic error

* Fix metric repeat issue

* Update dataset runtime and add environment info

* Revise val submission file saving path

* Put the correct query into the gpt extraction

* Update sleep time in utils.py

* update

---------

Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>
Co-authored-by: jzhang38 <a1286225768@gmail.com>
Co-authored-by: kcz358 <kaichenzhang358@outlook.com>
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
* Refactor logging in lmms_eval package

* Refactor variable names in lmms_eval package

* Update README.md with new features and installation instructions

* Update supported models and datasets

* Delete otter.py file

* Fix capitalization in README.md

* Update image sizes and add new features

* Refactor README.md to improve readability and add new features

* Add description for lmms-eval in README.md

* Update accelerator support in README.md

* Update lmms-eval README with improved description and additional features

* Update README.md with improved task grouping description

* change `Otter-AI/MME` to `lmms-lab/MME`

* Update README.md

* Update README.md

* Remove unused code in mme.yaml

* Squashed commit of the following:

commit 2782eb0
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 7e8d3e4
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 4fa73ba
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* add llava main in pyproject

* Update README.md

* Remove unnecessary dependencies and add specific version for llava_repr

* Add dependencies for llava_repr***

* Update README.md

* add some docs on models and command line commands

* remove some lines

* typo

* Update model_guide.md

* Update model_guide.md

* Update README.md

* Update README.md

* Update README.md

* Fix refcocog dataset path

* Record gpt response in eval info

* Resolve conflict

* Fix hallusionbench gpt json saving path

* Rename hallubench gpt output path

* Change remove image to check by type instead of check by names

* More robust check by type

* Add timeout to API requests

* Remove unnecessary img in data

* Forcing an empty commit.

* Testing

* Delete unnecessary things

* Fix error logging in get_chat_response function

* Fix seedbench2 image issue in doc_to_text

* Add conditional exclude for internal eval

* Squashed commit of the following:

commit e873012d0da2711f2076f7c09f390901f89da2f9
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:49:36 2024 +0000

    Add conditional exclude for internal eval

commit 621cdd663e0197827a5792872f13cdf3d27d2813
Merge: a3cae8e ffb9eb2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:24:29 2024 +0000

    Merge branch 'dev/readme' into kc/final_fix

commit 6daf75c54fe3d45970c5d35a10000f10c1420c6b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 02:47:31 2024 +0000

    Fix seedbench2 image issue in doc_to_text

commit 2a7a03205a2514fe0322ab4aa05c4948f9233109
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:32:49 2024 +0000

    Delete unnecessary things

commit a99850057224596d01835fface39d4aafd79de3e
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:31:42 2024 +0000

    Testing

commit 42f5fc125c7ee7d31633647f29f0d02ed3e640a8
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:29:30 2024 +0000

    Forcing an empty commit.

commit dddd0276003115c8a150a78eb3ae7bd299c460e4
Merge: 786f2b5 1700786
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:56 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit bcffe0b45083f48886e18d5ece5f2504b96bbcbd
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:20 2024 +0000

    Remove unnecessary img in data

commit f6705996b992363f2fd3c5dedb90e1bd51d04426
Merge: 4240785 888c1c1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:41:24 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 9290fc1c27ecca86f7ec3df0d932c7fa228e19c9
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:40:51 2024 +0000

    More robust check by type

commit 2fceaaf8f855d08d642996cd217ec0f6fc0fa04c
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:00:57 2024 +0000

    Change remove image to check by type instead of check by names

commit 33c0a81c91733e9aabe214f0797be2fdd3df1f1c
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 12:33:02 2024 +0000

    Rename hallubench gpt output path

commit 90ad0ace136a35ecc16a09ce841736842f7eb6dd
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 09:32:52 2024 +0000

    Fix hallusionbench gpt json saving path

commit 15b0336a932ef1823696e63672837700ce4fdae9
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:51:13 2024 +0000

    Resolve conflict

commit f75e7cfd35b1ee814f86abb9d4fbace027c00941
Merge: 9cf86fa 93534dc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:37:21 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 06c51ea7682e31964ca720a8a40705a3a7f3f360
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:55:03 2024 +0000

    Record gpt response in eval info

commit cdf7e6f77f7b6eee960e01e80c00ec74b8c1fbe7
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:49:01 2024 +0000

    Fix refcocog dataset path

commit 2782eb0
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 7e8d3e4
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 4fa73ba
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* Fix small bugs in list_with_num

* Revise list_with_num model args

* Dev/readme rm rolling (EvolvingLMMs-Lab#60)

* remove log_likelyhood_rolling

* Update time efficiency benchmark in README.md

* add task guide

---------

Co-authored-by: jzhang38 <a1286225768@gmail.com>
Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>

* Remove unnecessary code and update dependencies

* Fix logging utils bug on wandb grouping

* Add reproduce envs

* Squashed commit of the following:

commit bf49a3e1de8431193bdf6f7688a4ff7f4683a84d
Merge: 2475639 f89a736
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 22:12:12 2024 +0800

    Merge branch 'main' into kc/final_fix

commit b535df91bc792b3b2b296572ec4692c75fdfe878
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 22:11:04 2024 +0800

    Add reproduce envs

commit d0539a0
Author: kcz358 <92624596+kcz358@users.noreply.github.com>
Date:   Sun Mar 3 21:19:15 2024 +0800

    [Fix] wandb group logging missing columns (EvolvingLMMs-Lab#61)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

    * Update README.md with new features and installation instructions

    * Update supported models and datasets

    * Delete otter.py file

    * Fix capitalization in README.md

    * Update image sizes and add new features

    * Refactor README.md to improve readability and add new features

    * Add description for lmms-eval in README.md

    * Update accelerator support in README.md

    * Update lmms-eval README with improved description and additional features

    * Update README.md with improved task grouping description

    * change `Otter-AI/MME` to `lmms-lab/MME`

    * Update README.md

    * Update README.md

    * Remove unused code in mme.yaml

    * Squashed commit of the following:

    commit 2782eb0
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Thu Feb 29 13:40:02 2024 +0800

        Dev/py add models (EvolvingLMMs-Lab#57)

        * add instructblip

        * minicpm_v

        * remove <image> from qwen-vl

        * speed up postprocessing

        * Optimize build context speed

        ---------

        Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 7e8d3e4
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Wed Feb 28 14:49:07 2024 +0800

        Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

        * refactor vizwizvqa task

        * Delete vqav2_test and vqav2_val YAML files

        * Refactor vqav2_process_results functions

        * Add a pack for vqav2

        * refactor okvqa

        * roll back vizwiz_vqa

        * Fix exact_match calculation in ok_vqa_process_results

        * Update OKVQA dataset name in readme

        * add model_specific_prompt_kwargs

        * add model_specific_prompt_kwargs to vizwiz_vqa

        * add model_specific_prompt_kwargs for vqav2

        * lint

        * fix a small bug for eval_logger

        * Refactor make_table function to display points as "  -  " if value is None

        * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

        * Refactor ok_vqa_aggreate_submissions function

        * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

        * Refactor VQA submission file saving

        * Update file utils

        * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

        * Refactor file path handling and submission generation

        * OKVQA path

        * vizwizvqa file

        * pack cmmmu

        * fix a small metric bug for cmmmu

        * Add higher_is_better flag to submission metric

        * Add CMMMU dataset to README.md

        * Add logging and refactor submission file generation in docvqa utils.py

        * pack docvqa

        * add traceback to print detailed error

        * Refactor docvqa_test_aggregate_results to accept additional arguments

        * Add metric check in evaluator.py and update test.yaml and val.yaml

        * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

        * merge textvqa

        * textvqa

        * Modify submission file generation for COCO test results

        * Update test result storage path

        * update coco cap file name

        * Update COCO 2017 Caption dataset name

        * ferret

        * Add Ferret dataset

        * Refactor hb_doc_to_text function to include model-specific prompts

        * Add IconQA and its subtasks

        * Refactor image list creation in doc_to_visual function

        * Add process_results function to default template

        * Update process_results function in iconqa utils.py

        * refactor flickr30k

        * change aggregation function

        * Fix formatting issues and update logging message

        * Fix llava can not handle only text question (no visuals)

        * Fix qwen can not handle no image question (no visuals)

        * Add fuyu prepare accelerator scripts

        * refactor mme

        * naming consistency

        * aggregation_submissions consistency

        * flickr30k naming consistency

        * remove submissions for mme

        * remove unused submission function

        * Refactor infovqa_test.yaml and infovqa_val.yaml

        * Refactor code for improved readability and maintainability

        * stvqa

        * remane sqa

        * Update lmms_eval textcaps files and utils.py

        * Update default prompt for text captions

        * Refactor textcaps_aggregation_result function

        * Add generate_submission_file function and update mathvista_aggregate_results signature

        * Update nocaps_test.yaml and nocaps_val.yaml

        * refractor internal_eval

        * Add internal evaluation datasets

        * pack multidocvqa

        * mmvet

        * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

        * Refractor llava wild

        * Refractor llava-bench-coco

        * Add JSON file generation for gpt evaluation details

        * mmmu

        * Remove MMBench English and Chinese tasks

        * Remove unnecessary return statement in mmbench_aggregate_test_results function

        * Fix distributed process group initialization

        * Update dataset paths and group names in mmbench test configs

        * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

        * Add torch module import

        * lint

        * Remove IconQA dataset from README.md

        * Add Multi-DocVQA and its submodules

        * Add new datasets and update task names

        * Refactor flickr_aggregation_result function to accept additional arguments

        * Add timeout kwargs in Accelerator constructor

        * Add encoding to be utf-8 for cmmmu

        * Fix llava try and catch, remove torch.distributed.init in main

        * Ds prepare script for llava

        ---------

        Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 4fa73ba
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Feb 27 22:52:07 2024 +0800

        [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

        * Refactor logging in lmms_eval package

        * Refactor variable names in lmms_eval package

    * add llava main in pyproject

    * Update README.md

    * Remove unnecessary dependencies and add specific version for llava_repr

    * Add dependencies for llava_repr***

    * Update README.md

    * add some docs on models and command line commands

    * remove some lines

    * typo

    * Update model_guide.md

    * Update model_guide.md

    * Update README.md

    * Update README.md

    * Update README.md

    * Fix refcocog dataset path

    * Record gpt response in eval info

    * Resolve conflict

    * Fix hallusionbench gpt json saving path

    * Rename hallubench gpt output path

    * Change remove image to check by type instead of check by names

    * More robust check by type

    * Remove unnecessary img in data

    * Forcing an empty commit.

    * Testing

    * Delete unnecessary things

    * Fix seedbench2 image issue in doc_to_text

    * Add conditional exclude for internal eval

    * Fix small bugs in list_with_num

    * Revise list_with_num model args

    * Fix logging utils bug on wandb grouping

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Co-authored-by: jzhang38 <a1286225768@gmail.com>

commit 7dc049915a1846177e0f9f8eab12366881f82157
Merge: 83358a4 5e1c9c7
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 07:25:48 2024 +0000

    Merge branch 'main' into kc/final_fix

commit 5ec98efc7b666341adc726b8d1d4779b6c543f7f
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sun Mar 3 07:23:19 2024 +0000

    Fix logging utils bug on wandb grouping

commit 105d781
Author: kcz358 <92624596+kcz358@users.noreply.github.com>
Date:   Sun Mar 3 13:01:11 2024 +0800

    [Fix] refcocog dataset path, record gpt prompt in internal eval, build context issue (EvolvingLMMs-Lab#59)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

    * Update README.md with new features and installation instructions

    * Update supported models and datasets

    * Delete otter.py file

    * Fix capitalization in README.md

    * Update image sizes and add new features

    * Refactor README.md to improve readability and add new features

    * Add description for lmms-eval in README.md

    * Update accelerator support in README.md

    * Update lmms-eval README with improved description and additional features

    * Update README.md with improved task grouping description

    * change `Otter-AI/MME` to `lmms-lab/MME`

    * Update README.md

    * Update README.md

    * Remove unused code in mme.yaml

    * Squashed commit of the following:

    commit 2782eb0
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Thu Feb 29 13:40:02 2024 +0800

        Dev/py add models (EvolvingLMMs-Lab#57)

        * add instructblip

        * minicpm_v

        * remove <image> from qwen-vl

        * speed up postprocessing

        * Optimize build context speed

        ---------

        Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 7e8d3e4
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Wed Feb 28 14:49:07 2024 +0800

        Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

        * refactor vizwizvqa task

        * Delete vqav2_test and vqav2_val YAML files

        * Refactor vqav2_process_results functions

        * Add a pack for vqav2

        * refactor okvqa

        * roll back vizwiz_vqa

        * Fix exact_match calculation in ok_vqa_process_results

        * Update OKVQA dataset name in readme

        * add model_specific_prompt_kwargs

        * add model_specific_prompt_kwargs to vizwiz_vqa

        * add model_specific_prompt_kwargs for vqav2

        * lint

        * fix a small bug for eval_logger

        * Refactor make_table function to display points as "  -  " if value is None

        * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

        * Refactor ok_vqa_aggreate_submissions function

        * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

        * Refactor VQA submission file saving

        * Update file utils

        * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

        * Refactor file path handling and submission generation

        * OKVQA path

        * vizwizvqa file

        * pack cmmmu

        * fix a small metric bug for cmmmu

        * Add higher_is_better flag to submission metric

        * Add CMMMU dataset to README.md

        * Add logging and refactor submission file generation in docvqa utils.py

        * pack docvqa

        * add traceback to print detailed error

        * Refactor docvqa_test_aggregate_results to accept additional arguments

        * Add metric check in evaluator.py and update test.yaml and val.yaml

        * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

        * merge textvqa

        * textvqa

        * Modify submission file generation for COCO test results

        * Update test result storage path

        * update coco cap file name

        * Update COCO 2017 Caption dataset name

        * ferret

        * Add Ferret dataset

        * Refactor hb_doc_to_text function to include model-specific prompts

        * Add IconQA and its subtasks

        * Refactor image list creation in doc_to_visual function

        * Add process_results function to default template

        * Update process_results function in iconqa utils.py

        * refactor flickr30k

        * change aggregation function

        * Fix formatting issues and update logging message

        * Fix llava can not handle only text question (no visuals)

        * Fix qwen can not handle no image question (no visuals)

        * Add fuyu prepare accelerator scripts

        * refactor mme

        * naming consistency

        * aggregation_submissions consistency

        * flickr30k naming consistency

        * remove submissions for mme

        * remove unused submission function

        * Refactor infovqa_test.yaml and infovqa_val.yaml

        * Refactor code for improved readability and maintainability

        * stvqa

        * remane sqa

        * Update lmms_eval textcaps files and utils.py

        * Update default prompt for text captions

        * Refactor textcaps_aggregation_result function

        * Add generate_submission_file function and update mathvista_aggregate_results signature

        * Update nocaps_test.yaml and nocaps_val.yaml

        * refractor internal_eval

        * Add internal evaluation datasets

        * pack multidocvqa

        * mmvet

        * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

        * Refractor llava wild

        * Refractor llava-bench-coco

        * Add JSON file generation for gpt evaluation details

        * mmmu

        * Remove MMBench English and Chinese tasks

        * Remove unnecessary return statement in mmbench_aggregate_test_results function

        * Fix distributed process group initialization

        * Update dataset paths and group names in mmbench test configs

        * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

        * Add torch module import

        * lint

        * Remove IconQA dataset from README.md

        * Add Multi-DocVQA and its submodules

        * Add new datasets and update task names

        * Refactor flickr_aggregation_result function to accept additional arguments

        * Add timeout kwargs in Accelerator constructor

        * Add encoding to be utf-8 for cmmmu

        * Fix llava try and catch, remove torch.distributed.init in main

        * Ds prepare script for llava

        ---------

        Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
        Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

    commit 4fa73ba
    Author: Li Bo <drluodian@gmail.com>
    Date:   Tue Feb 27 22:52:07 2024 +0800

        [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

        * Refactor logging in lmms_eval package

        * Refactor variable names in lmms_eval package

    * add llava main in pyproject

    * Update README.md

    * Remove unnecessary dependencies and add specific version for llava_repr

    * Add dependencies for llava_repr***

    * Update README.md

    * add some docs on models and command line commands

    * remove some lines

    * typo

    * Update model_guide.md

    * Update model_guide.md

    * Update README.md

    * Update README.md

    * Update README.md

    * Fix refcocog dataset path

    * Record gpt response in eval info

    * Resolve conflict

    * Fix hallusionbench gpt json saving path

    * Rename hallubench gpt output path

    * Change remove image to check by type instead of check by names

    * More robust check by type

    * Remove unnecessary img in data

    * Forcing an empty commit.

    * Testing

    * Delete unnecessary things

    * Fix seedbench2 image issue in doc_to_text

    * Add conditional exclude for internal eval

    * Fix small bugs in list_with_num

    * Revise list_with_num model args

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
    Co-authored-by: jzhang38 <a1286225768@gmail.com>

commit 8263ca91c87a127d992dd01bdac5f89b8a5ff521
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 05:58:08 2024 +0000

    Revise list_with_num model args

commit c413569d46be0ad604cd249df8bd58ffe26c0e39
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 05:09:15 2024 +0000

    Fix small bugs in list_with_num

commit e873012d0da2711f2076f7c09f390901f89da2f9
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:49:36 2024 +0000

    Add conditional exclude for internal eval

commit 621cdd663e0197827a5792872f13cdf3d27d2813
Merge: a3cae8e ffb9eb2
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 03:24:29 2024 +0000

    Merge branch 'dev/readme' into kc/final_fix

commit 6daf75c54fe3d45970c5d35a10000f10c1420c6b
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Sat Mar 2 02:47:31 2024 +0000

    Fix seedbench2 image issue in doc_to_text

commit 2a7a03205a2514fe0322ab4aa05c4948f9233109
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:32:49 2024 +0000

    Delete unnecessary things

commit a99850057224596d01835fface39d4aafd79de3e
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:31:42 2024 +0000

    Testing

commit 42f5fc125c7ee7d31633647f29f0d02ed3e640a8
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:29:30 2024 +0000

    Forcing an empty commit.

commit dddd0276003115c8a150a78eb3ae7bd299c460e4
Merge: 786f2b5 1700786
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:56 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit bcffe0b45083f48886e18d5ece5f2504b96bbcbd
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 15:24:20 2024 +0000

    Remove unnecessary img in data

commit f6705996b992363f2fd3c5dedb90e1bd51d04426
Merge: 4240785 888c1c1
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:41:24 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 9290fc1c27ecca86f7ec3df0d932c7fa228e19c9
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:40:51 2024 +0000

    More robust check by type

commit 2fceaaf8f855d08d642996cd217ec0f6fc0fa04c
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 13:00:57 2024 +0000

    Change remove image to check by type instead of check by names

commit 33c0a81c91733e9aabe214f0797be2fdd3df1f1c
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 12:33:02 2024 +0000

    Rename hallubench gpt output path

commit 90ad0ace136a35ecc16a09ce841736842f7eb6dd
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 09:32:52 2024 +0000

    Fix hallusionbench gpt json saving path

commit 15b0336a932ef1823696e63672837700ce4fdae9
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:51:13 2024 +0000

    Resolve conflict

commit f75e7cfd35b1ee814f86abb9d4fbace027c00941
Merge: 9cf86fa 93534dc
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 08:37:21 2024 +0000

    Merge branch 'kc/final_fix' into dev/readme

commit 06c51ea7682e31964ca720a8a40705a3a7f3f360
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:55:03 2024 +0000

    Record gpt response in eval info

commit cdf7e6f77f7b6eee960e01e80c00ec74b8c1fbe7
Author: kcz358 <kaichenzhang358@outlook.com>
Date:   Fri Mar 1 07:49:01 2024 +0000

    Fix refcocog dataset path

commit 2782eb0
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Feb 29 13:40:02 2024 +0800

    Dev/py add models (EvolvingLMMs-Lab#57)

    * add instructblip

    * minicpm_v

    * remove <image> from qwen-vl

    * speed up postprocessing

    * Optimize build context speed

    ---------

    Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 7e8d3e4
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Wed Feb 28 14:49:07 2024 +0800

    Pufanyi/flickr30k refractor (EvolvingLMMs-Lab#56)

    * refactor vizwizvqa task

    * Delete vqav2_test and vqav2_val YAML files

    * Refactor vqav2_process_results functions

    * Add a pack for vqav2

    * refactor okvqa

    * roll back vizwiz_vqa

    * Fix exact_match calculation in ok_vqa_process_results

    * Update OKVQA dataset name in readme

    * add model_specific_prompt_kwargs

    * add model_specific_prompt_kwargs to vizwiz_vqa

    * add model_specific_prompt_kwargs for vqav2

    * lint

    * fix a small bug for eval_logger

    * Refactor make_table function to display points as "  -  " if value is None

    * Merge commit 'c5e52a785d3cc87a866be9b880deb477d9f73fb7'

    * Refactor ok_vqa_aggreate_submissions function

    * Merge commit 'e5aa0a9601d6d8ce727315e4b0a8f13f06f26bff'

    * Refactor VQA submission file saving

    * Update file utils

    * Merge commit '560deca9f72483ca091795d6dc2537d4c54b32b0'

    * Refactor file path handling and submission generation

    * OKVQA path

    * vizwizvqa file

    * pack cmmmu

    * fix a small metric bug for cmmmu

    * Add higher_is_better flag to submission metric

    * Add CMMMU dataset to README.md

    * Add logging and refactor submission file generation in docvqa utils.py

    * pack docvqa

    * add traceback to print detailed error

    * Refactor docvqa_test_aggregate_results to accept additional arguments

    * Add metric check in evaluator.py and update test.yaml and val.yaml

    * add common `EvalAIAnswerProcessor` for okvqa, textvqa, vizwizvqa and vqav2

    * merge textvqa

    * textvqa

    * Modify submission file generation for COCO test results

    * Update test result storage path

    * update coco cap file name

    * Update COCO 2017 Caption dataset name

    * ferret

    * Add Ferret dataset

    * Refactor hb_doc_to_text function to include model-specific prompts

    * Add IconQA and its subtasks

    * Refactor image list creation in doc_to_visual function

    * Add process_results function to default template

    * Update process_results function in iconqa utils.py

    * refactor flickr30k

    * change aggregation function

    * Fix formatting issues and update logging message

    * Fix llava can not handle only text question (no visuals)

    * Fix qwen can not handle no image question (no visuals)

    * Add fuyu prepare accelerator scripts

    * refactor mme

    * naming consistency

    * aggregation_submissions consistency

    * flickr30k naming consistency

    * remove submissions for mme

    * remove unused submission function

    * Refactor infovqa_test.yaml and infovqa_val.yaml

    * Refactor code for improved readability and maintainability

    * stvqa

    * remane sqa

    * Update lmms_eval textcaps files and utils.py

    * Update default prompt for text captions

    * Refactor textcaps_aggregation_result function

    * Add generate_submission_file function and update mathvista_aggregate_results signature

    * Update nocaps_test.yaml and nocaps_val.yaml

    * refractor internal_eval

    * Add internal evaluation datasets

    * pack multidocvqa

    * mmvet

    * Fix gpt eval timeout issue for hallubench, restore load from gpt to avoid re evaluating

    * Refractor llava wild

    * Refractor llava-bench-coco

    * Add JSON file generation for gpt evaluation details

    * mmmu

    * Remove MMBench English and Chinese tasks

    * Remove unnecessary return statement in mmbench_aggregate_test_results function

    * Fix distributed process group initialization

    * Update dataset paths and group names in mmbench test configs

    * Update import statements in cc_utils.py, cn_utils.py, and en_utils.py

    * Add torch module import

    * lint

    * Remove IconQA dataset from README.md

    * Add Multi-DocVQA and its submodules

    * Add new datasets and update task names

    * Refactor flickr_aggregation_result function to accept additional arguments

    * Add timeout kwargs in Accelerator constructor

    * Add encoding to be utf-8 for cmmmu

    * Fix llava try and catch, remove torch.distributed.init in main

    * Ds prepare script for llava

    ---------

    Co-authored-by: JvThunder <joshuaadrianc@gmail.com>
    Co-authored-by: kcz358 <kaichenzhang358@outlook.com>

commit 4fa73ba
Author: Li Bo <drluodian@gmail.com>
Date:   Tue Feb 27 22:52:07 2024 +0800

    [Wandb Logger] add models, and args to wandb tables. (EvolvingLMMs-Lab#55)

    * Refactor logging in lmms_eval package

    * Refactor variable names in lmms_eval package

* Update commands.md

* Add repr_scripts for reference

* Add timeout for gpt4V

* Remove unnecessary dependencies

* Add reproduce into readme

* Revise seedbench process_result

* Fix exclude dc hardcode postprocess logic error

* Fix metric repeat issue

* Update dataset runtime and add environment info

* Revise val submission file saving path

* Put the correct query into the gpt extraction

* Update sleep time in utils.py

* update

---------

Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>
Co-authored-by: jzhang38 <a1286225768@gmail.com>
Co-authored-by: kcz358 <kaichenzhang358@outlook.com>
kangreen0210 pushed a commit to kangreen0210/LIME that referenced this pull request Oct 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants