[VLM] Accuracy Evaluation #2393

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

hanyunfan merged 35 commits into mlcommons:master from CentML:jcalderon/vlm-accuracy-eval

Dec 2, 2025

Contributor

johncalesp commented Nov 19, 2025

This PR serves as a proposal to evaluate the accuracy of the VLM.
Notes

Added min.query.count to the Task since as of now, LoadGen sends a number of request equal to the total number of records in the dataset and we need a knob to control the number of request to send.
Added the class Evaluator to run the evaluation (the method calculate_exact_match was left in case we need during development, but it may be deleted later)
Added extra dependencies.

As of now, running 1k samples, I get:

╒═══════════════╤════════════╕
│ Fields        │   F1 Score │
╞═══════════════╪════════════╡
│ category      │   0.777359 │
├───────────────┼────────────┤
│ is_secondhand │   0.105263 │
╘═══════════════╧════════════╛


          Initial proposal for VLM - evaluation

35c9704

johncalesp requested a review from a team as a code owner

November 19, 2025 16:23

Contributor

github-actions bot commented Nov 19, 2025 •

edited

Loading

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

wangshangsam suggested changes

View reviewed changes

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/cli.py Outdated Show resolved Hide resolved

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/task.py Outdated Show resolved Hide resolved

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/task.py Outdated Show resolved Hide resolved

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/task.py Outdated Show resolved Hide resolved

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/task.py Show resolved Hide resolved

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/evaluation.py Outdated Show resolved Hide resolved

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/evaluation.py Outdated Show resolved Hide resolved

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/evaluation.py Outdated Show resolved Hide resolved

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/evaluation.py Outdated Show resolved Hide resolved

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/evaluation.py Outdated Show resolved Hide resolved

Contributor

wangshangsam commented Nov 20, 2025

As it stands, is_secondhand seems quite low. Did you get this score after including the product description or before? Which model did you use for this?

johncalesp and others added 2 commits

November 20, 2025 14:52


          address review comments and test hiclass implementation

f5995c3


          [Automated Commit] Format Codebase

46e4fc0

Contributor Author

johncalesp commented Nov 20, 2025

The results were after including product description, and the model that I'm using is Qwen/Qwen2.5-VL-32B-Instruct

wangshangsam suggested changes

View reviewed changes

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/task.py Outdated Show resolved Hide resolved

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/task.py Outdated Show resolved Hide resolved

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/task.py Outdated Show resolved Hide resolved

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/cli.py Outdated Show resolved Hide resolved

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/evaluation.py Outdated Show resolved Hide resolved

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/evaluation.py Outdated Show resolved Hide resolved

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/evaluation.py Outdated Show resolved Hide resolved

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/evaluation.py Outdated Show resolved Hide resolved

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/evaluation.py Outdated Show resolved Hide resolved

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/evaluation.py Outdated Show resolved Hide resolved

Contributor

wangshangsam commented Nov 21, 2025

The results were after including product description, and the model that I'm using is Qwen/Qwen2.5-VL-32B-Instruct

We should try on the actual Qwen/Qwen3-VL-235B-A22B-Instruct and Qwen/Qwen3-VL-235B-A22B-Thinking, but that can wait after this PR is merged.

johncalesp and others added 2 commits

November 22, 2025 21:07


          additional fixes to reviews

0dc13dc


          [Automated Commit] Format Codebase

4aa5e0d

wangshangsam suggested changes

View reviewed changes

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/task.py Outdated Show resolved Hide resolved

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/task.py Show resolved Hide resolved

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/task.py Show resolved Hide resolved

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/task.py Outdated Show resolved Hide resolved

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/task.py Outdated Show resolved Hide resolved

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/cli.py Outdated Show resolved Hide resolved

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/cli.py Outdated Show resolved Hide resolved

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/cli.py Outdated Show resolved Hide resolved

johncalesp and others added 3 commits

November 24, 2025 11:00


          address PR comments

5e1590b


          [Automated Commit] Format Codebase

e1ccc85


          add a more detail description of the field dataset.split

7e0c444

wangshangsam approved these changes

View reviewed changes

Contributor

wangshangsam left a comment

LGTM! Thanks a lot, John!

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/task.py Outdated Show resolved Hide resolved

Contributor

wangshangsam commented Nov 24, 2025

This PR seems ready. @hanyunfan @mrmhodak @arjunsuresh I'm wondering if you could take a look, approve and merge it? Thanks!

wangshangsam and others added 4 commits

November 25, 2025 05:12


          Enable exception logging in _query_endpoint_async

b35b057


          [Automated Commit] Format Codebase

48b5bdb


          Merge branch 'master' into jcalderon/vlm-accuracy-eval

0e4c5ee


          [Automated Commit] Format Codebase

f8e1498

mrmhodak previously approved these changes

View reviewed changes


          Trigger CI/CD pipeline

f464499

Contributor

wangshangsam commented Nov 25, 2025

@arjunsuresh Could you merge this PR? Thanks a lot!

wangshangsam added 3 commits

November 26, 2025 13:27


          Merge branch 'master' into jcalderon/vlm-accuracy-eval

9609cd0


          Add performance_sample_count_override as a CLI flag.

bc56ec9


          Merge branch 'jcalderon/vlm-accuracy-eval' of github.com:CentML/mlper…

b8e2909

…f-inference into jcalderon/vlm-accuracy-eval

wangshangsam dismissed mrmhodak’s stale review via

b8e2909

November 26, 2025 19:33


          [Automated Commit] Format Codebase

8b43239

wangshangsam and others added 3 commits

November 26, 2025 14:34


          Merge branch 'master' into jcalderon/vlm-accuracy-eval


          add json format to queries

dae5065


          [Automated Commit] Format Codebase

c840dd6

wangshangsam suggested changes

View reviewed changes

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/task.py Outdated Show resolved Hide resolved

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/task.py Outdated Show resolved Hide resolved

johncalesp and others added 12 commits

November 26, 2025 17:25


          added schema file and made necessary changes

0b45001


          [Automated Commit] Format Codebase

5f1d02c


          refactoring and linting

1849d6c


          [Automated Commit] Format Codebase

eef83eb


          Add Dockerfile

dafa7f1


          Add use_guided_decoding to let user choose to use guided_decoding or …

ee91e7f

…not.


          [Automated Commit] Format Codebase

b9dd5ad


          add f1 scores of uniform random selection

ace336e


          [Automated Commit] Format Codebase

60f72be


          Enabling mlperf-inf-mm-vl2l benchmark vllm.

9c7b793


          Merge branch 'jcalderon/vlm-accuracy-eval' of github.com:CentML/mlper…

443ff3d

…f-inference into jcalderon/vlm-accuracy-eval


          [Automated Commit] Format Codebase

36ab421

Contributor

hanyunfan commented Dec 2, 2025 •

edited

Loading

@anandhu-eng Could you help to check the CLA checker


          Commit to trigger the GitHub Actions in inference PR

ea1e465

Contributor

anandhu-eng commented Dec 2, 2025 •

edited

Loading

Hi @johncalesp @wangshangsam , could you merge this PR or please feel free to do a commit with this change. This should trigger the GitHub action.

I'm not able to do a direct commit to the PR branch

wangshangsam and others added 2 commits

December 2, 2025 12:39


          Merge pull request #6 from anandhu-eng/patch-39

93a1a3e

Commit to trigger the GitHub Actions in inference PR


          empty commit

a1e6d76

Contributor

wangshangsam commented Dec 2, 2025

Looked like pushing an empty commit would work. @hanyunfan could you help to approve and merge this PR?

hanyunfan self-requested a review

December 2, 2025 18:27

hanyunfan approved these changes

View reviewed changes

Contributor

hanyunfan left a comment

LGTM

hanyunfan merged commit fff953c into mlcommons:master

13 checks passed

github-actions bot locked and limited conversation to collaborators

Contributor

hanyunfan commented Dec 2, 2025

@wangshangsam Done

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet