Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to evaluate model outputs on testset #2

Open
AdonLee072348 opened this issue Feb 8, 2024 · 5 comments
Open

how to evaluate model outputs on testset #2

AdonLee072348 opened this issue Feb 8, 2024 · 5 comments

Comments

@AdonLee072348
Copy link

No description provided.

@CMMMU-Benchmark
Copy link
Owner

Hi,

Our evaluation server for the test set is now available on EvalAI. We welcome all submissions and look forward to your participation!
Thank you for your attention.

Best wishes!

@AdonLee072348
Copy link
Author

Hi,

Our evaluation server for the test set is now available on EvalAI. We welcome all submissions and look forward to your participation! Thank you for your attention.

Best wishes!

Thanks for your reply! We have evaluated our model Marco-VL-Plus on EvalAI, and the accuracy results on val and test dataset are 43.44 and 40.69 respectively. Would you please consider showing our result in your repo?

val result:
| Subject | Correct Num | Entries Num | Acc |
|--------------------------------+---------------+---------------+----------|
| art_and_design | 58 | 88 | 0.659091 |
| business | 31 | 126 | 0.246032 |
| science | 71 | 204 | 0.348039 |
| health_and_medicine | 83 | 153 | 0.542484 |
| humanities_and_social_sciences | 46 | 85 | 0.541176 |
| technology_and_engineering | 102 | 244 | 0.418033 |
| all | 391 | 900 | 0.434444 |

test result:
{'Art & Design': 66.72777268560954, 'Business': 21.911573472041614, 'Science': 36.80834001603849, 'Health & Medicine': 46.27345844504022, 'Humanities & Social Sciences': 47.10982658959538, 'Technology & Engineering': 38.36583725622058, 'Overall': 40.69090909090909}}], 'submission_result': {'test_split': {'Art & Design': 66.72777268560954, 'Business': 21.911573472041614, 'Science': 36.80834001603849, 'Health & Medicine': 46.27345844504022, 'Humanities & Social Sciences': 47.10982658959538, 'Technology & Engineering': 38.36583725622058, 'Overall': 40.69090909090909}

@XinrunDu
Copy link
Collaborator

Hi,
Our evaluation server for the test set is now available on EvalAI. We welcome all submissions and look forward to your participation! Thank you for your attention.
Best wishes!

Thanks for your reply! We have evaluated our model Marco-VL-Plus on EvalAI, and the accuracy results on val and test dataset are 43.44 and 40.69 respectively. Would you please consider showing our result in your repo?

val result: | Subject | Correct Num | Entries Num | Acc | |--------------------------------+---------------+---------------+----------| | art_and_design | 58 | 88 | 0.659091 | | business | 31 | 126 | 0.246032 | | science | 71 | 204 | 0.348039 | | health_and_medicine | 83 | 153 | 0.542484 | | humanities_and_social_sciences | 46 | 85 | 0.541176 | | technology_and_engineering | 102 | 244 | 0.418033 | | all | 391 | 900 | 0.434444 |

test result: {'Art & Design': 66.72777268560954, 'Business': 21.911573472041614, 'Science': 36.80834001603849, 'Health & Medicine': 46.27345844504022, 'Humanities & Social Sciences': 47.10982658959538, 'Technology & Engineering': 38.36583725622058, 'Overall': 40.69090909090909}}], 'submission_result': {'test_split': {'Art & Design': 66.72777268560954, 'Business': 21.911573472041614, 'Science': 36.80834001603849, 'Health & Medicine': 46.27345844504022, 'Humanities & Social Sciences': 47.10982658959538, 'Technology & Engineering': 38.36583725622058, 'Overall': 40.69090909090909}

Thank you for your reply, and for your interest in the CMMMU benchmark.

Regarding your inquiry about submitting your model to the leaderboard, there are a few details we need to confirm with you:

  1. Do you have plans to open source your model, and should it be categorized under open source or private models?
  2. Is your model name displayed on the leaderboard confirmed to be Marco-VL-Plus?

We appreciate your support and contribution to our work once again.
Should you have any further questions or require additional assistance, please feel free to contact us at any time.

Best,
CMMMU Team

@AdonLee072348
Copy link
Author

Hi,
Our evaluation server for the test set is now available on EvalAI. We welcome all submissions and look forward to your participation! Thank you for your attention.
Best wishes!

Thanks for your reply! We have evaluated our model Marco-VL-Plus on EvalAI, and the accuracy results on val and test dataset are 43.44 and 40.69 respectively. Would you please consider showing our result in your repo?
val result: | Subject | Correct Num | Entries Num | Acc | |--------------------------------+---------------+---------------+----------| | art_and_design | 58 | 88 | 0.659091 | | business | 31 | 126 | 0.246032 | | science | 71 | 204 | 0.348039 | | health_and_medicine | 83 | 153 | 0.542484 | | humanities_and_social_sciences | 46 | 85 | 0.541176 | | technology_and_engineering | 102 | 244 | 0.418033 | | all | 391 | 900 | 0.434444 |
test result: {'Art & Design': 66.72777268560954, 'Business': 21.911573472041614, 'Science': 36.80834001603849, 'Health & Medicine': 46.27345844504022, 'Humanities & Social Sciences': 47.10982658959538, 'Technology & Engineering': 38.36583725622058, 'Overall': 40.69090909090909}}], 'submission_result': {'test_split': {'Art & Design': 66.72777268560954, 'Business': 21.911573472041614, 'Science': 36.80834001603849, 'Health & Medicine': 46.27345844504022, 'Humanities & Social Sciences': 47.10982658959538, 'Technology & Engineering': 38.36583725622058, 'Overall': 40.69090909090909}

Thank you for your reply, and for your interest in the CMMMU benchmark.

Regarding your inquiry about submitting your model to the leaderboard, there are a few details we need to confirm with you:

  1. Do you have plans to open source your model, and should it be categorized under open source or private models?
  2. Is your model name displayed on the leaderboard confirmed to be Marco-VL-Plus?

We appreciate your support and contribution to our work once again. Should you have any further questions or require additional assistance, please feel free to contact us at any time.

Best, CMMMU Team

Thank you for your great job in CMMMU benchmark.
We are currently still a private model, but will release it in the future.
Yeah, our model name is Marco-VL-Plus.

@shan23chen
Copy link

Great project!

And would love to see whether you guys can provide the test answer key for a subset of the health and science partition.
And hope to chat and see whether we can collaborate!

Thanks!
Shan Chen
PhD candidate @ Harvard AIM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants