MoA/README.md at main · togethercomputer/MoA #884
Labels
AI-Agents
Autonomous AI agents using LLMs
AI-Chatbots
Topics related to advanced chatbot platforms integrating multiple AI models
ai-leaderboards
leaderdoards for llm's and other ml models
ai-platform
model hosts and APIs
Git-Repo
Source code repository like gitlab or gh
github
gh tools like cli, Actions, Issues, Pages
llm
Large Language Models
llm-applications
Topics related to practical applications of Large Language Models in various fields
llm-benchmarks
testing and benchmarking large language models
llm-evaluation
Evaluating Large Language Models performance and behavior through human-written evaluation sets
llm-experiments
experiments with large language models
Papers
Research papers
Mixture-of-Agents (MoA)
Overview · Quickstart · Advanced example · Interactive CLI Demo · Evaluation · Results . Credits
Overview
Mixture of Agents (MoA) is a novel approach that leverages the collective strengths of multiple LLMs to enhance performance, achieving state-of-the-art results. By employing a layered architecture where each layer comprises several LLM agents, MoA significantly outperforms GPT-4 Omni's 57.5% on AlpacaEval 2.0 with a score of 65.1%, using only open-source models!
Quickstart: MoA in 50 LOC
To get to get started with using MoA in your own apps, see
moa.py
. In this simple example, we'll use 2 layers and 4 LLMs. You'll need to:pip install together
export TOGETHER_API_KEY=
python moa.py
Multi-layer MoA Example
In the previous example, we went over how to implement MoA with 2 layers (4 LLMs answering and one LLM aggregating). However, one strength of MoA is being able to go through several layers to get an even better response. In this example, we'll go through how to run MoA with 3+ layers in
advanced-moa.py
.Interactive CLI Demo
This interactive CLI demo showcases a simple multi-turn chatbot where the final response is aggregated from various reference models.
To run the interactive demo, follow these 3 steps:
export TOGETHER_API_KEY={your_key}
pip install -r requirements.txt
python bot.py
The CLI will prompt you to input instructions interactively:
[Optional] Additional Configuration
The demo will ask you to specify certain options but if you want to do additional configuration, you can specify these parameters:
--aggregator
: The primary model used for final response generation.--reference_models
: List of models used as references.--temperature
: Controls the randomness of the response generation.--max_tokens
: Maximum number of tokens in the response.--rounds
: Number of rounds to process the input for refinement. (num rounds == num of MoA layers - 1)--num_proc
: Number of processes to run in parallel for faster execution.--multi_turn
: Boolean to toggle multi-turn interaction capability.Evaluation
We provide scripts to quickly reproduce some of the results presented in our paper
For convenience, we have included the code from AlpacaEval,
MT-Bench, and FLASK, with necessary modifications.
We extend our gratitude to these projects for creating the benchmarks.
Preparation
Run AlpacaEval 2
To run AlpacaEval 2, execute the following scripts:
Run MT-Bench
For a minimal example of MT-Bench evaluation, run:
Run FLASK
For a minimal example of FLASK evaluation, run:
Results
We achieved top positions on both the AlpacaEval 2.0 leaderboard and MT-Bench. Notably, on AlpacaEval 2.0, using solely open-source models, we achieved a margin of 7.6% absolute improvement from 57.5% (GPT-4 Omni) to 65.1% (MoA).
FLASK offers fine-grained evaluation of models across multiple dimensions. Our MoA method significantly outperforms the original Qwen1.5-110B-Chat on harmlessness, robustness, correctness, efficiency, factuality, commonsense, insightfulness, completeness. Additionally, MoA also outperforms GPT-4 Omni in terms of correctness, factuality, insightfulness, completeness, and metacognition.
Please feel free to contact us if you have difficulties in reproducing the results.
Credits
Notably, this work was made possible by the collaborative spirit and contributions of active organizations in the AI field. We appreciate the efforts of Meta AI, Mistral AI, Microsoft, Alibaba Cloud, and DataBricks for developing the Llama 3, Mixtral, WizardLM 2, Qwen 1.5, and DBRX models. Additionally, we extend our gratitude to Tatsu Labs, LMSYS, and KAIST AI for developing the AlpacaEval, MT-Bench, and FLASK evaluation benchmarks.
License
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
Citation
If you find this work helpful, please consider citing:
Suggested labels
None
The text was updated successfully, but these errors were encountered: