-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MultiAgentLLM a faithful recreation of the Small LLMs Are Weak Tool Learners: A Multi-LLM Agent research paper #681
Comments
Related issues#628: LLaVA/README.md at main · haotian-liu/LLaVA### DetailsSimilarity score: 0.89 - [ ] [LLaVA/README.md at main · haotian-liu/LLaVA](https://github.com/haotian-liu/LLaVA/blob/main/README.md?plain=1)LLaVA/README.md at main · haotian-liu/LLaVA🌋 LLaVA: Large Language and Vision AssistantVisual instruction tuning towards large language and vision models with GPT-4 level capabilities. 📢 LLaVA-NeXT Blog Project Page Demo Data Model Zoo 🤝Community Contributions: llama.cpp Colab 🤗Space Replicate AutoGen BakLLaVA Improved Baselines with Visual Instruction Tuning Paper HF Visual Instruction Tuning (NeurIPS 2023, Oral) Paper HF Release
More
Usage and License Notices: This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses, including but not limited to the OpenAI Terms of Use for the dataset and the specific licenses for base language models for checkpoints trained using the dataset (e.g. Llama community license for LLaMA-2 and Vicuna-v1.5). This project does not impose any additional constraints beyond those stipulated in the original licenses. Furthermore, users are reminded to ensure that their use of the dataset and checkpoints is in compliance with all applicable laws and regulations. ContentsSuggested labels#333: Paper Digest: NeurIPS-2023 Highlights (Full List)### DetailsSimilarity score: 0.89 - [ ] [Paper Digest: NeurIPS-2023 Highlights (Full List)](https://www.paperdigest.org/data/neurips-2023-full.html)Paper Digest: NeurIPS 2023 Highlights 1, Toolformer: Language Models Can Teach Themselves to Use Tools 2, Self-Refine: Iterative Refinement with Self-Feedback 3, Vicuna Evaluation: Exploring LLM-as-a-Judge and Chatbot Arena Suggested labels{ "key": "LLM-Applications", "value": "Topics related to practical applications of Large Language Models in various fields" }#317: Streaming-llm: Efficient Streaming Language Models with Attention Sinks### DetailsSimilarity score: 0.89 - [ ] [mit-han-lab/streaming-llm: Efficient Streaming Language Models with Attention Sinks](https://github.com/mit-han-lab/streaming-llm) # Efficient Streaming Language Models with Attention Sinks [[paper](http://arxiv.org/abs/2309.17453)] [[slides](assets/StreamingLLM.pdf)][[video](https://youtu.be/hvJsEzP34o8)]streamingllm_demo.mp4TL;DRWe deploy LLMs for infinite-length inputs without sacrificing efficiency and performance. News
AbstractDeploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly, during the decoding stage, caching previous tokens' Key and Value states (KV) consumes extensive memory. Secondly, popular LLMs cannot generalize to longer texts than the training sequence length. Window attention, where only the most recent KVs are cached, is a natural approach --- but we show that it fails when the text length surpasses the cache size. We observe an interesting phenomenon, namely attention sink, that keeping the KV of initial tokens will largely recover the performance of window attention. In this paper, we first demonstrate that the emergence of attention sink is due to the strong attention scores towards initial tokens as a ``sink'' even if they are not semantically important. Based on the above analysis, we introduce StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence length without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more. In addition, we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment. In streaming settings, StreamingLLM outperforms the sliding window recomputation baseline by up to 22.2x speedup. UsageEnvironment Setupconda create -yn streaming python=3.8
conda activate streaming
pip install torch torchvision torchaudio
pip install transformers==4.33.0 accelerate datasets evaluate wandb scikit-learn scipy sentencepiece
python setup.py develop Run Streaming Llama ChatbotCUDA_VISIBLE_DEVICES=0 python examples/run_streaming_llama.py --enable_streaming FAQ
TODOsWe will release the code and data in the following order, please stay tuned!
CitationIf you find StreamingLLM useful or relevant to your project and research, please kindly cite our paper: @article{xiao2023streamingllm,
title={Efficient Streaming Language Models with Attention Sinks},
author={Xiao, Guangxuan and Tian, Yuandong and Chen, Beidi and Han, Song and Lewis, Mike},
journal={arXiv},
year={2023}
}
```</details>
### #332: streaming-llm: Efficient Streaming Language Models with Attention Sinks
<details><summary>### Details</summary>Similarity score: 0.88
> **Note: Efficient Streaming Language Models with Attention Sinks**
>
> [mit-han-lab/streaming-llm: Efficient Streaming Language Models with Attention Sinks](https://github.com/mit-han-lab/streaming-llm)
>
> **TL;DR**
>
> We deploy LLMs for infinite-length inputs without sacrificing efficiency and performance.
>
> **News**
>
> - [2023/10] StreamingLLM is integrated into Intel Extension for Transformers.
> - [2023/10] Check out Attention Sinks, a third-party implementation to enable StreamingLLM on more Huggingface LLMs.
>
> **Abstract**
>
> Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly, during the decoding stage, caching previous tokens' Key and Value states (KV) consumes extensive memory. Secondly, popular LLMs cannot generalize to longer texts than the training sequence length. Window attention, where only the most recent KVs are cached, is a natural approach --- but we show that it fails when the text length surpasses the cache size. We observe an interesting phenomenon, namely attention sink, that keeping the KV of initial tokens will largely recover the performance of window attention. In this paper, we first demonstrate that the emergence of attention sink is due to the strong attention scores towards initial tokens as a "sink" even if they are not semantically important. Based on the above analysis, we introduce StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence length without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more. In addition, we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment. In streaming settings, StreamingLLM outperforms the sliding window recomputation baseline by up to 22.2x speedup.
>
> **Usage**
>
> **Environment Setup**
>
> ```
> conda create -yn streaming python=3.8
> conda activate streaming
>
> pip install torch torchvision torchaudio
> pip install transformers==4.33.0 accelerate datasets evaluate wandb scikit-learn scipy sentencepiece
>
> python setup.py develop
> ```
>
> **Run Streaming Llama Chatbot**
>
> ```
> CUDA_VISIBLE_DEVICES=0 python examples/run_streaming_llama.py --enable_streaming
> ```
>
> **FAQ**
>
> **What does "working on infinite-length inputs" imply for LLMs?**
>
> Handling infinite-length text with LLMs presents challenges. Notably, storing all previous Key and Value (KV) states demands significant memory, and models might struggle to generate text beyond their training sequence length. StreamingLLM addresses this by retaining only the most recent tokens and attention sinks, discarding intermediate tokens. This enables the model to generate coherent text from recent tokens without a cache reset — a capability not seen in earlier methods.
>
> **Is the context window of LLMs expanded?**
>
> No. The context window remains unchanged. Only the most recent tokens and attention sinks are retained, discarding middle tokens. This means the model can only process the latest tokens. The context window remains constrained by its initial pre-training. For instance, if Llama-2 is pre-trained with a context window of 4096 tokens, then the maximum cache size for StreamingLLM on Llama-2 remains 4096.
>
> **Can I input an extensive text, like a book, into StreamingLLM for summarization?**
>
> While you can input a lengthy text, the model will only recognize the latest tokens.</details>
### #657: Finetuning LLMs for ReAct. Unleashing the power of finetuning to… | by Pranav Jadhav | Feb, 2024 | Towards AI
<details><summary>### Details</summary>Similarity score: 0.88
- [ ] [Finetuning LLMs for ReAct. Unleashing the power of finetuning to… | by Pranav Jadhav | Feb, 2024 | Towards AI](https://pub.towardsai.net/finetuning-llms-for-react-9ab291d84ddc)
# Finetuning LLMs for ReAct
**Description:**
Finetuning LLMs for ReAct Unleashing the power of finetuning to improve multi-hop question-answering ability in LLMs.
**Author:** Pranav Jadhav
**Published in:** Towards AI
**Reading Time:** 14 min read
**Published:** 6 days ago
**Views:** 71
![Image](https://unsplash.com/photos/XXXXX)
In this article, I will share my findings in benchmarking and finetuning open-source language models for ReAct (Reasoning + Acting). I demonstrate that finetuning can dramatically improve the accuracy of LLMs in answering multi-hop questions using ReAct. I also present a new dataset that can be used to finetune models for the ReAct format presented by the original paper (Yao et al., 2022). My findings indicate that, through finetuning, open-source LLMs show promise for making agents that can effectively reason and use tools.
**Language Models Reasoning?**
Since ChatGPT started the language model gold rush, we’ve been consistently surprised by the abilities of these neural networks to imitate our speech and writing. However, a key component of intelligence that distanced these models from ourselves was reasoning. The reasoning barrier first faltered when chain-of-thought (CoT) prompting was introduced by Wei et al. in 2022. They found that simply prompting the language model to “think step by step” and output intermediate reasoning steps improved accuracy on question-answering tasks. However, the reasoning ability of LLMs didn’t end there. Another development in reasoning was chain-of-thought with self-consistency (CoT-SC), where multiple reasoning traces were generated and the majority answer is returned as the final answer (Wang et al., 2022). Then in late 2022, a team of researchers from Princeton University and Google Research published a paper called ReAct: Synergizing Reasoning and Acting in Language Models. In this paper, the team introduces a method of prompting LLMs to output a sequence of thought, action, and observation steps to reach a final answer.
**What is ReAct?**
Simply put, ReAct is a prompting strategy to force an LLM to “reason” about what it is doing and interact with tools using actions. I will give a basic explanation here, but for a deep dive, I recommend looking at the blog post or the paper.
[Read More](https://pub.towardsai.net/finetuning-llms-for-react-9ab291d84ddc)
#### Suggested labels
#### {'label-name': 'ReAct-Prompting', 'label-description': 'Describes the method of prompting LLMs to output a sequence of thought, action, and observation steps to reach a final answer', 'gh-repo': 'https://pub.towardsai.net/finetuning-llms-for-react-9ab291d84ddc', 'confidence': 63.39}</details>
### #494: Awesome-Efficient-LLM: A curated list for Efficient Large Language Models
<details><summary>### Details</summary>Similarity score: 0.88
- [ ] [horseee/Awesome-Efficient-LLM: A curated list for Efficient Large Language Models](https://github.com/horseee/Awesome-Efficient-LLM#inference-acceleration)
# Awesome-Efficient-LLM
A curated list for [Efficient Large Language Models](https://github.com/horseee/Awesome-Efficient-LLM):
- [Knowledge Distillation](#knowledge-distillation)
- [Network Pruning](#network-pruning)
- [Quantization](#quantization)
- [Inference Acceleration](#inference-acceleration)
- [Efficient MOE](#efficient-moe)
- [Text Compression](#text-compression)
- [Low-Rank Decomposition](#low-rank-decomposition)
- [Hardware/System Tuning](#hardwareSystem-tuning)
- [Survey](#survey)
- [Leaderboard](#leaderboard)
- [🚀 Updates](#updates)
- [Contributing](#contributing)
---
## Inference Acceleration
- …
- [Add your paper here](https://github.com/horseee/Awesome-Efficient-LLM/blob/main/generate_item.py), [generate the required format](https://github.com/horseee/Awesome-Efficient-LLM#decontributing), and submit a pull request.
---
## Updates
- **Sep 27, 2023:** Add tag for papers accepted at NeurIPS'23.
- **Sep 6, 2023:** Add a new subdirectory `project/` to organize those projects designed for developing a lightweight LLM.
- **July 11, 2023:** Create a new subdirectory `efficient_plm/` for papers applicable to PLMs (such as BERT, BART) but have yet to be verified for their effectiveness on LLMs.
---
## Contributing
If you'd like to include your paper or need to update any details, please feel free to submit a pull request. You can generate the required markdown format for each paper by filling in the information in `generate_item.py` and execute `python generate_item.py`. We warmly appreciate your contributions to this list. Alternatively, you can email me with the links to your paper and code, and I would add your paper to the list at my earliest convenience.
- URL: [https://github.com/horseee/Awesome-Efficient-LLM#inference-acceleration](https://github.com/horseee/Awesome-Efficient-LLM#inference-acceleration)
#### Suggested labels
#### { "label-name": "efficient-llm-acceleration", "description": "Inference acceleration techniques for efficient large language models.", "repo": "horseee/Awesome-Efficient-LLM", "confidence": 70.8 }</details>
|
RichardAragon/MultiAgentLLM
DESCRIPTION: "Multi Agent Language Learning Machine (Multi Agent LLM)
(Update) 1/20/2024: Global Fine Tuned Phi Model Located Here: PhiGlobalFineTunedAgent
I can complete the second fine tunes for the individual agent actors (Planner, Caller, Observer). I cannot complete the fine tuning process and upload the completed models to HuggingFace due to compute limits. I need more GPUs :(
(Update) 1/19/2024: All datasets available at HuggingFace Repo: TuringsSolutions
Introduction
Welcome to the official repository for the Multi Agent LLM, a faithful recreation of the Small LLMs Are Weak Tool Learners: A Multi-LLM Agent research paper framework under the MIT Open Source License. Our aim is to fine-tune distinct Planner, Caller, and Summarizer Agents capable of completing complex tasks efficiently. This will be completed utilizing 3 Tiny Llama models. Datasets included from the research paper and from the Gorilla release will be utilized to create further synthetic datasets to train the 3 distinct models.
Getting Started
Prerequisites
requirements.txt
Installation
Multi Agent LLM Methodology Overview
Our project introduces the Multi Agent LLM framework, built around Large Language Models (LLMs) to handle complex tasks involving tool usage and decision-making processes. This framework draws inspiration from the ReACT framework (Yao et al., 2022) and is aimed at addressing challenges faced by Single LLM solutions for Tool Learning tasks.
Three main modules constitute the Multi Agent LLM architecture: Planner (M_plan), Caller (M_call), and Summarizer (M_sum). By dividing labor amongst the agents and dedicating a specific LLM for each sub-task, we enhance the overall effectiveness of tackling complex problems involving task planning, tool selection, and result summarization.
The α-UMi Framework workflow commences when the user inputs a query (). The Planner module creates a rationale () guiding the upcoming step. According to , the procedure advances to the caller being triggered to interact with the tools and collect observations. After gathering enough information, the planner shifts control over to the Summarizer module, which forms the final response for the user. In contrast, if the instruction remains unresolved, the system abandons the attempt.
Each module plays a distinctive role:
Planner: Acting as the brain, the Planner receives the user instruction, prior execution trajectory (), and system prompt () and outputs a rationale (): $$ r_t = M_{plan}(P_{plan},\ tau_{t-1},\ q)$$
Caller: Trained to concentrate solely on producing proper tool interaction commands, the Caller accepts user instruction and preceding execution trajectory (). With guidance from , the Caller delivers the requested action ():
Summarizer: Dedicated to crafting the final meaningful response, the Summarizer obtains the user instruction, former execution trajectory (), , and final rationale (), delivering the final answer ():
Global-to-Local Progressive Fine-Tuning Strategy
We propose a novel two-stage fine-tuning technique – Global-to-Local Progressive Fine-Tuning (GLPFT) – applied to α-UMi framework modules. GLPFT ensures effective fine-tuning adapted to specific roles. Initially, a shared base LLM experiences global fine-tuning on generic large datasets. Following this, specializations occur during local fine-tuning, fine-tuning subsets aligned with the roles and duties assumed by the dedicated modules. Additional details concerning data organization and prompt adaptation appear in Appendex A.
Here are the steps to create the global fine-tuning dataset for training the backbone LLM:
So in summary - keep full trajectories together as one long target text, have multiple variants per user instruction, and do not differentiate between sub-tasks at this stage. Let me know if you need any other details!
Here is a methodology you can follow to create the training data needed for the Planner agent, based on the approach outlined in the research paper:
Here is an example row from the global fine-tuning dataset:
So in this example, the full trajectory from the initial user query to the final flight booking is kept intact as one sequence, which the model will be trained to generate end-to-end. Here is an example training set row for fine-tuning the Planner specifically:
In this example, only the rationales from the Planner are kept as the target text. The actions, observations, and answers are removed. The format is updated to have the "Next: Caller/Summarizer" appended to each rationale. And the prompt provides some context about the user's flight booking request.
This structures the data specifically for fine-tuning the Planner's ability to generate helpful rationales and decide the next high-level step.
References
This project is distributed under the terms of the MIT Open Source License. Contributions are welcome! Please feel free to submit Pull Requests and report issues. Refer to CONTRIBUTING for guidelines."
Suggested labels
The text was updated successfully, but these errors were encountered: