Leaderboard available here: Clem Leaderboard
See CHANGELOG
The list of supported open & closed/commercial models can be found here: model registry
Each model has a separate folder for each game result. The outputs are organised as follows: /model/game/experiment. Each episode under a certain experiment includes the following files:
- instance.json : info about a certain episode including the prompt text
- interactions.json: interaction among players and game master
- requests.json: given inputs and generated outputs for the tested model
- scores.json: generated scores for the episode and turn level
- transcript.html: transcript of the dialogue in HTML
- transcript.tex: transcript of the dialogue in LaTeX
Each run of the benchmark generates CSV and HTML files for all tested models across all games (results.csv & results.html).