Skip to content

Conversation

@johncalesp
Copy link
Contributor

This PR serves as a proposal to evaluate the accuracy of the VLM.
Notes

  • Added min.query.count to the Task since as of now, LoadGen sends a number of request equal to the total number of records in the dataset and we need a knob to control the number of request to send.
  • Added the class Evaluator to run the evaluation (the method calculate_exact_match was left in case we need during development, but it may be deleted later)
  • Added extra dependencies.

As of now, running 1k samples, I get:

╒═══════════════╤════════════╕
│ Fields        │   F1 Score │
╞═══════════════╪════════════╡
│ category      │   0.777359 │
├───────────────┼────────────┤
│ is_secondhand │   0.105263 │
╘═══════════════╧════════════╛

@johncalesp johncalesp requested a review from a team as a code owner November 19, 2025 16:23
@github-actions
Copy link
Contributor

github-actions bot commented Nov 19, 2025

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@wangshangsam
Copy link
Contributor

As it stands, is_secondhand seems quite low. Did you get this score after including the product description or before? Which model did you use for this?

@johncalesp
Copy link
Contributor Author

The results were after including product description, and the model that I'm using is Qwen/Qwen2.5-VL-32B-Instruct

@wangshangsam
Copy link
Contributor

The results were after including product description, and the model that I'm using is Qwen/Qwen2.5-VL-32B-Instruct

We should try on the actual Qwen/Qwen3-VL-235B-A22B-Instruct and Qwen/Qwen3-VL-235B-A22B-Thinking, but that can wait after this PR is merged.

Copy link
Contributor

@wangshangsam wangshangsam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks a lot, John!

@wangshangsam
Copy link
Contributor

This PR seems ready. @hanyunfan @mrmhodak @arjunsuresh I'm wondering if you could take a look, approve and merge it? Thanks!

mrmhodak
mrmhodak previously approved these changes Nov 25, 2025
@wangshangsam
Copy link
Contributor

@arjunsuresh Could you merge this PR? Thanks a lot!

@hanyunfan
Copy link
Contributor

hanyunfan commented Dec 2, 2025

@anandhu-eng Could you help to check the CLA checker

@anandhu-eng
Copy link
Contributor

anandhu-eng commented Dec 2, 2025

Hi @johncalesp @wangshangsam , could you merge this PR or please feel free to do a commit with this change. This should trigger the GitHub action.

I'm not able to do a direct commit to the PR branch

wangshangsam and others added 2 commits December 2, 2025 12:39
Commit to trigger the GitHub Actions in inference PR
@wangshangsam
Copy link
Contributor

Looked like pushing an empty commit would work. @hanyunfan could you help to approve and merge this PR?

@hanyunfan hanyunfan self-requested a review December 2, 2025 18:27
Copy link
Contributor

@hanyunfan hanyunfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hanyunfan hanyunfan merged commit fff953c into mlcommons:master Dec 2, 2025
13 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Dec 2, 2025
@hanyunfan
Copy link
Contributor

@wangshangsam Done

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants