Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raw scores? #10

Open
WesleyYue opened this issue May 14, 2024 · 3 comments
Open

Raw scores? #10

WesleyYue opened this issue May 14, 2024 · 3 comments

Comments

@WesleyYue
Copy link

Hey authors, really nice work!

The paper shows scores that are averaged across tasks for each test. Are the full set of task scores per model available anywhere? Particularly, for Gemini, only the final averaged score is available on Github.

Also, any plans to test beyond 128k for Gemini? Given that the test doesn't saturate at 128k for Gemini, it seems important.

@hsiehjackson
Copy link
Collaborator

Hi @WesleyYue, here are the Gemini scores for each task category. For other models, you can find the results in appendix of our paper.

#Tasks 4K 8K 16K 32K 64K 128K 256K
NIAH 8 99.8 99.9 99.6 99.7 99.7 99.6 98.7
VT 1 100 100 100 100 99.6 100 100.0
AG 2 97.7 97.7 97.6 98.6 97.3 90.9 95.9
QA 2 81.9 75.9 77.8 75.9 77.6 74.1 74.2
Total 13 96.7 95.8 96.0 95.9 95.9 94.4 94.6

We have also tested beyond 128K, but we receive a lot of API errors for length > 256K.

@WesleyYue
Copy link
Author

Thank you! I was referring to the individual task scores. In the paper, it seems to only show averages across a category (for example, NIAH is an average of 8 scores, Aggregation is avg of 2, etc).

The API erroring out beyond 256k is actually a pretty interesting data point. I thought >128k might have been skipped due to cost constraints.

@dorzim
Copy link

dorzim commented Aug 20, 2024

Hi @hsiehjackson,
I was trying to reproduce the results of Gemini 1.5 pro on 128k, and the results that I got were extremely different.
I suspect that model has changed, and would appreciate if you can rerun the tests on this model or share the branch/ local code if exists.
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants