how to evaluate model outputs on testset #2

AdonLee072348 · 2024-02-08T16:16:44Z

No description provided.

CMMMU-Benchmark · 2024-03-06T09:22:19Z

Hi,

Our evaluation server for the test set is now available on EvalAI. We welcome all submissions and look forward to your participation!
Thank you for your attention.

Best wishes!

AdonLee072348 · 2024-04-08T06:03:16Z

Hi,

Our evaluation server for the test set is now available on EvalAI. We welcome all submissions and look forward to your participation! Thank you for your attention.

Best wishes!

Thanks for your reply! We have evaluated our model Marco-VL-Plus on EvalAI, and the accuracy results on val and test dataset are 43.44 and 40.69 respectively. Would you please consider showing our result in your repo?

val result:
| Subject | Correct Num | Entries Num | Acc |
|--------------------------------+---------------+---------------+----------|
| art_and_design | 58 | 88 | 0.659091 |
| business | 31 | 126 | 0.246032 |
| science | 71 | 204 | 0.348039 |
| health_and_medicine | 83 | 153 | 0.542484 |
| humanities_and_social_sciences | 46 | 85 | 0.541176 |
| technology_and_engineering | 102 | 244 | 0.418033 |
| all | 391 | 900 | 0.434444 |

test result:
{'Art & Design': 66.72777268560954, 'Business': 21.911573472041614, 'Science': 36.80834001603849, 'Health & Medicine': 46.27345844504022, 'Humanities & Social Sciences': 47.10982658959538, 'Technology & Engineering': 38.36583725622058, 'Overall': 40.69090909090909}}], 'submission_result': {'test_split': {'Art & Design': 66.72777268560954, 'Business': 21.911573472041614, 'Science': 36.80834001603849, 'Health & Medicine': 46.27345844504022, 'Humanities & Social Sciences': 47.10982658959538, 'Technology & Engineering': 38.36583725622058, 'Overall': 40.69090909090909}

XinrunDu · 2024-04-18T14:13:08Z

Hi,
Our evaluation server for the test set is now available on EvalAI. We welcome all submissions and look forward to your participation! Thank you for your attention.
Best wishes!

Thanks for your reply! We have evaluated our model Marco-VL-Plus on EvalAI, and the accuracy results on val and test dataset are 43.44 and 40.69 respectively. Would you please consider showing our result in your repo?

val result: | Subject | Correct Num | Entries Num | Acc | |--------------------------------+---------------+---------------+----------| | art_and_design | 58 | 88 | 0.659091 | | business | 31 | 126 | 0.246032 | | science | 71 | 204 | 0.348039 | | health_and_medicine | 83 | 153 | 0.542484 | | humanities_and_social_sciences | 46 | 85 | 0.541176 | | technology_and_engineering | 102 | 244 | 0.418033 | | all | 391 | 900 | 0.434444 |

test result: {'Art & Design': 66.72777268560954, 'Business': 21.911573472041614, 'Science': 36.80834001603849, 'Health & Medicine': 46.27345844504022, 'Humanities & Social Sciences': 47.10982658959538, 'Technology & Engineering': 38.36583725622058, 'Overall': 40.69090909090909}}], 'submission_result': {'test_split': {'Art & Design': 66.72777268560954, 'Business': 21.911573472041614, 'Science': 36.80834001603849, 'Health & Medicine': 46.27345844504022, 'Humanities & Social Sciences': 47.10982658959538, 'Technology & Engineering': 38.36583725622058, 'Overall': 40.69090909090909}

Thank you for your reply, and for your interest in the CMMMU benchmark.

Regarding your inquiry about submitting your model to the leaderboard, there are a few details we need to confirm with you:

Do you have plans to open source your model, and should it be categorized under open source or private models?
Is your model name displayed on the leaderboard confirmed to be Marco-VL-Plus?

We appreciate your support and contribution to our work once again.
Should you have any further questions or require additional assistance, please feel free to contact us at any time.

Best,
CMMMU Team

AdonLee072348 · 2024-04-19T07:37:12Z

Hi,
Our evaluation server for the test set is now available on EvalAI. We welcome all submissions and look forward to your participation! Thank you for your attention.
Best wishes!

Thanks for your reply! We have evaluated our model Marco-VL-Plus on EvalAI, and the accuracy results on val and test dataset are 43.44 and 40.69 respectively. Would you please consider showing our result in your repo?
val result: | Subject | Correct Num | Entries Num | Acc | |--------------------------------+---------------+---------------+----------| | art_and_design | 58 | 88 | 0.659091 | | business | 31 | 126 | 0.246032 | | science | 71 | 204 | 0.348039 | | health_and_medicine | 83 | 153 | 0.542484 | | humanities_and_social_sciences | 46 | 85 | 0.541176 | | technology_and_engineering | 102 | 244 | 0.418033 | | all | 391 | 900 | 0.434444 |
test result: {'Art & Design': 66.72777268560954, 'Business': 21.911573472041614, 'Science': 36.80834001603849, 'Health & Medicine': 46.27345844504022, 'Humanities & Social Sciences': 47.10982658959538, 'Technology & Engineering': 38.36583725622058, 'Overall': 40.69090909090909}}], 'submission_result': {'test_split': {'Art & Design': 66.72777268560954, 'Business': 21.911573472041614, 'Science': 36.80834001603849, 'Health & Medicine': 46.27345844504022, 'Humanities & Social Sciences': 47.10982658959538, 'Technology & Engineering': 38.36583725622058, 'Overall': 40.69090909090909}

Thank you for your reply, and for your interest in the CMMMU benchmark.

Regarding your inquiry about submitting your model to the leaderboard, there are a few details we need to confirm with you:

Do you have plans to open source your model, and should it be categorized under open source or private models?

Is your model name displayed on the leaderboard confirmed to be Marco-VL-Plus?

We appreciate your support and contribution to our work once again. Should you have any further questions or require additional assistance, please feel free to contact us at any time.

Best, CMMMU Team

Thank you for your great job in CMMMU benchmark.
We are currently still a private model, but will release it in the future.
Yeah, our model name is Marco-VL-Plus.

shan23chen · 2024-08-06T15:16:25Z

Great project!

And would love to see whether you guys can provide the test answer key for a subset of the health and science partition.
And hope to chat and see whether we can collaborate!

Thanks!
Shan Chen
PhD candidate @ Harvard AIM

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to evaluate model outputs on testset #2

how to evaluate model outputs on testset #2

AdonLee072348 commented Feb 8, 2024

CMMMU-Benchmark commented Mar 6, 2024

AdonLee072348 commented Apr 8, 2024

XinrunDu commented Apr 18, 2024

AdonLee072348 commented Apr 19, 2024

shan23chen commented Aug 6, 2024

how to evaluate model outputs on testset #2

how to evaluate model outputs on testset #2

Comments

AdonLee072348 commented Feb 8, 2024

CMMMU-Benchmark commented Mar 6, 2024

AdonLee072348 commented Apr 8, 2024

XinrunDu commented Apr 18, 2024

AdonLee072348 commented Apr 19, 2024

shan23chen commented Aug 6, 2024