Raw scores? #10

WesleyYue · 2024-05-14T14:42:09Z

Hey authors, really nice work!

The paper shows scores that are averaged across tasks for each test. Are the full set of task scores per model available anywhere? Particularly, for Gemini, only the final averaged score is available on Github.

Also, any plans to test beyond 128k for Gemini? Given that the test doesn't saturate at 128k for Gemini, it seems important.

hsiehjackson · 2024-05-21T01:23:12Z

Hi @WesleyYue, here are the Gemini scores for each task category. For other models, you can find the results in appendix of our paper.

	#Tasks	4K	8K	16K	32K	64K	128K	256K
NIAH	8	99.8	99.9	99.6	99.7	99.7	99.6	98.7
VT	1	100	100	100	100	99.6	100	100.0
AG	2	97.7	97.7	97.6	98.6	97.3	90.9	95.9
QA	2	81.9	75.9	77.8	75.9	77.6	74.1	74.2
Total	13	96.7	95.8	96.0	95.9	95.9	94.4	94.6

We have also tested beyond 128K, but we receive a lot of API errors for length > 256K.

WesleyYue · 2024-05-21T04:18:55Z

Thank you! I was referring to the individual task scores. In the paper, it seems to only show averages across a category (for example, NIAH is an average of 8 scores, Aggregation is avg of 2, etc).

The API erroring out beyond 256k is actually a pretty interesting data point. I thought >128k might have been skipped due to cost constraints.

dorzim · 2024-08-20T11:21:08Z

Hi @hsiehjackson,
I was trying to reproduce the results of Gemini 1.5 pro on 128k, and the results that I got were extremely different.
I suspect that model has changed, and would appreciate if you can rerun the tests on this model or share the branch/ local code if exists.
Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raw scores? #10

Raw scores? #10

WesleyYue commented May 14, 2024

hsiehjackson commented May 21, 2024

WesleyYue commented May 21, 2024

dorzim commented Aug 20, 2024

Raw scores? #10

Raw scores? #10

Comments

WesleyYue commented May 14, 2024

hsiehjackson commented May 21, 2024

WesleyYue commented May 21, 2024

dorzim commented Aug 20, 2024