-
Notifications
You must be signed in to change notification settings - Fork 100
1‐Overview
How to evaluate the ability of large language models (LLM) is an open question after ChatGPT-like LLMs prevailing the community. Existing evaluation methods suffer from following shortcomings:
(1) constrained evaluation abilities, (2) vulnerable benchmarks, (3) unobjective metrics.
We suggest that task-based evaluation, where LLM agents complete tasks in a simulated environment, is a one-for-all solution to solve above problems.
We present AgentSims, an easy-to-use infrastructure for researchers from all disciplines to test the specific capacities they are interested in.
Researchers can build their evaluation tasks by adding agents and buildings on an interactive GUI or deploy and test new support mechanisms, i.e. memory system and planning system, by a few lines of codes.
We present a demonstration of our system on https://agentsims.com/. Due to the limitations of the live demo, we have provided the full version of our sandbox in this repository. Users can follow our deployment instructions to set up the tasks they need to evaluate in our sandbox.