(placeholder name)
A framework for tracking AI large language model (LLM) foundational models, fine tunes, data sets, and evals.
- lhl's LLM Worksheet
- CRFM Helm
- CRFM Ecosystem Graphs
- Open LLMs
- Viktor Garske's AI / ML / LLM / Transformer Models Timeline and List
- awesome-marketing-datascience Open LLM Models List
- HuggingFace Open LLM Leaderboard
- warning, their MMLU results are wrong, throwing off the whole ranking: https://twitter.com/Francis_YAO_/status/1666833311279517696
- LMSys Chatbot Arena Leaderboard - ELO style ranking
- LLM-Leaderboard
- Gotzmann LLM Score v2 (discussion)
- Chain-of-Thought Hub
- C-Eval Leaderboard
- llm-humaneval-benchmarks - HuggingFace models evald vs HumanEval+
- CanAiCode Leaderboard - using Can AI Code? eval
- AlpacaEval Leaderboard
- YearZero's LLM Logic Tests
- HELM Core Scenarios
- TextSynth Server
- airate - C++ bug catching test
- llm-jeopardy - automated quiz show answering
Single open, comprehensive repository that is a superset of existing lists, and that allows for low friction submissions, updates, collaboration.
- FastAPI API for queries
- Permissive (CC0?) data set available as YAML, Datasette, etc
- Robust/extensible/historical data model:
- Entities/Organizations
- Models (foundational, fine tunes)
- Versions (sizes, checkpoints, quantizes)
- Evals (repeatable benchmark, rankings, contributions)
- Allow submissions (rollbacks, updates) via either GH pull requests or just a GH/HF Auth workflow
- Figure out importing, sourcing evals
- Tracking submissions by date
- Custom views
- Live w/ CRFM (or LMSys or other long-running org?)
- Should have community Discord
- Be welcoming of all contributors