Skip to content

Conversation

@Xian-Gao
Copy link

Hello.

This PR adds a new benchmark MME-SCI. MME-SCI is a comprehensive multimodal benchmark designed to evaluate the scientific reasoning capabilities of Multimodal Large Language Models (MLLMs). It addresses key limitations of existing benchmarks by focusing on multilingual adaptability, comprehensive modality coverage, and fine-grained knowledge point annotation.

Arxiv: https://www.arxiv.org/abs/2508.13938

Data: https://huggingface.co/datasets/JCruan/MME-SCI

Thank you!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should be move into your task folder

Comment on lines 2 to 6
dataset_path: "parquet"
dataset_kwargs:
data_dir: "~/.cache/huggingface/datasets"
data_files:
- "~/.cache/huggingface/datasets/datasets--JCruan--MME-SCI/snapshots/local_snapshot/mmesci_1019_zh.parquet"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can actually config your hub parquet file to make this load from load_dataset("JCruan/MME-SCI", split="xxx"), can check here for reference. Otherwise people might need to hardcode the local_snapshot path

Comment on lines +70 to +75
img.save(buffered, format="PNG")
img_b64 = base64.b64encode(buffered.getvalue()).decode("utf-8")
content.append({
"type": "image",
"url": f"data:image/png;base64,{img_b64}"
})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work in most of the case when using the chat model? I am not sure if our protocol handles the base64 format correctly haha. When I design it I was expecting the url value is a pillow image.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can put the two run script into your task folder or in the examples folder. Thanks!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this the run judge sglang are mostly hardcoded? The model path, mem fraction and input output file etc. Wonder if this can further improved. Saw that you are using the SGLangLauncher from the lmms-eval, is it possible to integrate this into the utils.py?

@kcz358
Copy link
Collaborator

kcz358 commented Nov 3, 2025

Hi, most of the part LGTM, just wondering if you can merge the sglang launcher using the launcher args and put the scoring logic into the utils. Thanks!

parser.add_argument(
"--launcher_args",
default=None,
help="String arguments for launcher for local llm as judge, e.g. `tp=8`, if None then no launcher will be used.",
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants