Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gemma llm context window extended to 90k using LongLM #661

Open
1 task
irthomasthomas opened this issue Feb 28, 2024 · 1 comment
Open
1 task

Gemma llm context window extended to 90k using LongLM #661

irthomasthomas opened this issue Feb 28, 2024 · 1 comment
Labels
AI-Chatbots Topics related to advanced chatbot platforms integrating multiple AI models ai-leaderboards leaderdoards for llm's and other ml models Algorithms Sorting, Learning or Classifying. All algorithms go here. llm Large Language Models New-Label Choose this option if the existing labels are insufficient to describe the content accurately Papers Research papers Research personal research notes for a topic

Comments

@irthomasthomas
Copy link
Owner

HongyeJ on X:

"Despite the mixed feelings about Google's latest Gemma model, we're big fans! @googleai Why? Coz we found it pairs incredibly well with our SelfExtend 🤣🤣🤣 - like, perfectly! With Self-Extend, no fine-tuning needed, we effortlessly expanded Gemma's window from 8k to 90k+! On… Link to tweet"

X

Description:
"Despite the mixed feelings about Google's latest Gemma model, we're big fans! @googleai Why? Coz we found it pairs incredibly well with our SelfExtend - like, perfectly! With Self-Extend, no fine-tuning needed, we effortlessly expanded Gemma's window from 8k to 90k+! On the 'Needle in the haystack' task, Gemma-2b-it even struggled at 8k, but with SelfExtend, Gemma-2b-it easily tackles it within 90k range! #AI #Gemma #SelfExtend #LLMs

Paper: Link to paper"

URL: Link to tweet

Suggested labels

{'label-description': 'Innovative AI model applications', 'label-name': 'AI-Applications', 'gh-repo': 'content-labels', 'confidence': 65.97}

@irthomasthomas irthomasthomas added AI-Chatbots Topics related to advanced chatbot platforms integrating multiple AI models ai-leaderboards leaderdoards for llm's and other ml models Algorithms Sorting, Learning or Classifying. All algorithms go here. llm Large Language Models New-Label Choose this option if the existing labels are insufficient to describe the content accurately Papers Research papers Research personal research notes for a topic labels Feb 28, 2024
@irthomasthomas
Copy link
Owner Author

Related issues

#625: unsloth/README.md at main · unslothai/unsloth

### DetailsSimilarity score: 0.87 - [ ] [unsloth/README.md at main · unslothai/unsloth](https://github.com/unslothai/unsloth/blob/main/README.md?plain=1)

unsloth/README.md at main · unslothai/unsloth

unsloth logo



Finetune Mistral, Gemma, Llama 2-5x faster with 70% less memory!

✨ Finetune for Free

All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face.

Unsloth supports Free Notebooks Performance Memory use
Gemma 7b ▶️ Start on Colab 2.4x faster 58% less
Mistral 7b ▶️ Start on Colab 2.2x faster 62% less
Llama-2 7b ▶️ Start on Colab 2.2x faster 43% less
TinyLlama ▶️ Start on Colab 3.9x faster 74% less
CodeLlama 34b A100 ▶️ Start on Colab 1.9x faster 27% less
Mistral 7b 1xT4 ▶️ Start on Kaggle 5x faster* 62% less
DPO - Zephyr ▶️ Start on Colab 1.9x faster 19% less

🦥 Unsloth.ai News

🔗 Links and Resources

Type Links
📚 Wiki & FAQ Read Our Wiki
📜 Documentation Read The Doc
💾 Installation unsloth/README.md
  Twitter (aka X) Follow us on X
🥇 Benchmarking Performance Tables
🌐 Released Models Unsloth Releases
✍️ Blog Read our Blogs

⭐ Key Features

  • All kernels written in OpenAI's Triton language. Manual backprop engine.
  • 0% loss in accuracy - no approximation methods - all exact.
  • No change of hardware. Supports NVIDIA GPUs since 2018+. Minimum CUDA Capability 7.0 (V100, T4, Titan V, RTX 20, 30, 40x, A100, H100, L40 etc) Check your GPU! GTX 1070, 1080 works, but is slow.
  • Works on Linux and Windows via WSL.
  • Supports 4bit and 16bit QLoRA / LoRA finetuning via bitsandbytes.
  • Open source trains 5x faster - see Unsloth Pro for 30x faster training!
  • If you trained a model with 🦥Unsloth, you can use this cool sticker!  

🥇 Performance Benchmarking

1 A100 40GB 🤗Hugging Face Flash Attention 🦥Unsloth Open Source 🦥Unsloth Pro
Alpaca 1x 1.04x 1.98x 15.64x
LAION Chip2 1x 0.92x 1.61x 20.73x
OASST 1x 1.19x 2.17x 14.83x
Slim Orca 1x 1.18x 2.22x 14.82x
Free Colab T4 Dataset 🤗Hugging Face Pytorch 2.1.1 🦥Unsloth 🦥 VRAM reduction
Llama-2 7b OASST 1x 1.19x 1.95x -43.3%
Mistral 7b Alpaca 1x 1.07x 1.56x -13.7%
Tiny Llama 1.1b Alpaca 1x 2.06x 3.87x -73.8%
DPO with Zephyr Ultra Chat 1x 1.09x 1.55x -18.6%

View on GitHub

Suggested labels

#647: Qwen-1.5-8x7B : r/LocalLLaMA

### DetailsSimilarity score: 0.85 - [ ] [Qwen-1.5-8x7B : r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/1atw4ud/qwen158x7b/)

TITLE: Qwen-1.5-8x7B : r/LocalLLaMA

DESCRIPTION: "Qwen-1.5-8x7B

New Model
Someone created a sparse MoE Qwen model by merging and finetuning Qwen1.5-7B

Model: Link to Model

Dataset: Link to Dataset

Thread:

I'm excited to release a project I've been working on the last couple of weeks.

Qwen1.5-8x7b: Link to Model

And the accompanying dataset created with the intention of encouraging MoE models to organically develop their own experts: Link to Dataset

The purpose and intention behind this project is better detailed in the model/dataset card, but basically:

I curated a diverse dataset from the highest quality conversations I could find. It's actually great. All sources are included in the dataset card.

I then trained Qwen1.5-7b on a 100k subset over 4 epochs.

Took that and made a MoE using @maximelabonne 's lazymergekit, utilizing a random gate and no base model.

Trained that on another 351,000 pairs. I had planned on doing 4 full epochs, but @runpod_io had cuda errors in my machine 3x, expending the rest of my budget for the project after only 0.45/4 epochs.

Good news:

Model is surprisingly awesome even at such a (comparatively) small training set size. Reasoning compares with Mixtral in my (very basic) tests.

Will benchmark it properly once runpod situation gets sorted, and plan to finish the rest of the training.

Thank you to @teknium1 , @jon_durbin , @erhartford , Maxime Labonne, and @chargoddard for their contributions to open source AI and making these processes accessible and transparent. And of course thank you to @mistralai for inspiring this work and @alibaba_cloud for releasing the weights of the Qwen1.5 family.

Teknium and Eric Hartford have been especially helpful, answering questions with humility and generosity.

We're just getting started."

URL: Link to Reddit Post

Suggested labels

{'label-name': 'MoE-model', 'label-description': 'Refers to a Mixture of Experts model created by merging and finetuning Qwen1.5-7B.', 'gh-repo': 'llm', 'confidence': 52.49}

#499: marella/ctransformers: Python bindings for the Transformer models implemented in C/C++ using GGML library.

### DetailsSimilarity score: 0.85 - [ ] [marella/ctransformers: Python bindings for the Transformer models implemented in C/C++ using GGML library.](https://github.com/marella/ctransformers?tab=readme-ov-file#gptq)

CTransformers

PyPI version
Documentation
![Build and Test](https://github.com/ marella / ctransformers / actions / workflows / build.yml / badge.svg)
Code style: black

Python bindings for the Transformer models implemented in C/C++ using GGML library. Also see ChatDocs

Supported Models

Model Model Type CUDA Metal
GPT-2 gpt2
GPT-J, GPT4All-J gptj
GPT-NeoX, StableLM gpt_neox
Falcon falcon
LLaMA, LLaMA 2 llamai
MPT mpt
StarCoder, StarChat gpt_bigcode
Dolly V2 dolly-v2
Replit replit

Installation

To install via pip, simply run:

pip install ctransformers

Usage

It provides a unified interface for all models:

from ctransformers import AutoModelForCausalLM

llm = AutoModelForCausalLM.from_pretrained("/path/to/ggml-model.bin", model_type="gpt2")

print(llm("AI is going to"))

Run in Google Colab

To stream the output:

for text in llm("AI is going to", stream=True):
    print(text, end="", flush=True)

You can load models from Hugging Face Hub directly:

llm = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml")

If a model repo has multiple model files (.bin or .gguf files), specify a model file using:

llm = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", model_file="ggml-model.bin")

🤗 Transformers

Note: This is an experimental feature and may change in the future.

To use with 🤗 Transformers, create the model and tokenizer using:

from ctransformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", hf=True)
tokenizer = AutoTokenizer.from_pretrained(model)

Run in Google Colab

You can use 🤗 Transformers text generation pipeline:

from transformers import pipeline

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
print(pipe("AI is going to", max_new_tokens=256))

You can use 🤗 Transformers generation parameters:

pipe("AI is going to", max_new_tokens=256, do_sample=True, temperature=0.8, repetition_penalty=1.1)

You can use 🤗 Transformers tokenizers:

from ctransformers import AutoModelForCausalLM
from transformers import AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", hf=True)  # Load model from GGML model repo.
tokenizer = AutoTokenizer.from_pretrained("gpt2")  # Load tokenizer from original model repo.

LangChain

It is integrated into LangChain. See LangChain docs.

GPU

To run some of the model layers on GPU, set the gpu_layers parameter:

llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GGML", gpu_layers=50)

Run in Google Colab

CUDA

Install CUDA libraries using:

pip install ctransformers[cuda]

ROCm

To enable ROCm support, install the ctransformers package using:

CT_HIPBLAS=1 pip install ctransformers --no-binary ctransformers

Metal

To enable Metal support, install the ctransformers package using:

CT_METAL=1 pip install ctransformers --no-binary ctransformers

GPTQ

Note: This is an experimental feature and only LLaMA models are supported using [ExLlama](https
://github.com/TheLastBen/exllama).

Install additional dependencies using:

pip install ctransformers[gptq]

Load a GPTQ model using:

llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GPTQ")

Run in Google Colab

If the model name or path doesn't contain the word gptq, specify model_type="gptq".

It can also be used with LangChain. Low-level APIs are not fully supported.

Documentation

Find the documentation on Read the Docs.

Config

Parameter Type Description Default
top_k int The top-k value to use for sampling 40
top_p float The top-p value to use for sampling 0.95
temperature float The temperature to use for sampling 0.8
repetition_penalty float The repetition penalty to use for sampling 1.1
last_n_tokens int The number of last tokens to use for repetition penalty 64
seed int The seed value to use for sampling tokens -1
max_new_tokens int The maximum number of new tokens to generate 256
stop List A list of sequences to stop generation when encountered None
stream bool Whether to stream the generated text False
reset bool Whether to reset the model state before generating text True
batch_size int The batch size to use for evaluating tokens in a single prompt 8
threads int The number of threads to use for evaluating tokens -1
context_length int The maximum context length to use -1
gpu_layers int The number of layers to run on GPU 0

Find the URL for the model card for GPTQ here.


Made with ❤️ by marella

Suggested labels

null

#383: deepseek-ai/deepseek-coder-5.7bmqa-base · Hugging Face

### DetailsSimilarity score: 0.84 - [ ] [deepseek-ai/deepseek-coder-5.7bmqa-base · Hugging Face](https://huggingface.co/deepseek-ai/deepseek-coder-5.7bmqa-base)

Deepseek Coder Introduction

Deepseek Coder is a series of code language models, each trained from scratch on 2T tokens with a composition of 87% code and 13% natural language in both English and Chinese. We provide various sizes of the code model, ranging from 1B to 33B versions. Each model is pre-trained on a project-level code corpus with a window size of 16K and an extra fill-in-the-blank task, supporting project-level code completion and infilling. Deepseek Coder achieves state-of-the-art performance among open-source code models on multiple programming languages and various benchmarks.

Key Features

  • Massive Training Data: Trained from scratch on 2T tokens, including 87% code and 13% linguistic data in both English and Chinese languages.
  • Highly Flexible & Scalable: Offered in model sizes of 1.3B, 5.7B, 6.7B, and 33B, enabling users to choose the setup most suitable for their requirements.
  • Superior Model Performance: State-of-the-art performance among publicly available code models on HumanEval, MultiPL-E, MBPP, DS-1000, and APPS benchmarks.
  • Advanced Code Completion Capabilities: A window size of 16K and a fill-in-the-blank task, supporting project-level code completion and infilling tasks.

Model Summary

How to Use

This section provides examples of how to use the Deepseek Coder model for code completion, code insertion, and repository-level code completion tasks.

Code Completion

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda()

input_text = "#write a quick sort algorithm"
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Code Insertion

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda()

input_text = """<|begin|>def quick_sort(arr):
    if len(arr) <= 1:
        return arr
    pivot = arr[0]
    left = []
    right = []
<|hole|>
    if arr[i] < pivot:
        left.append(arr[i])
    else:
        right.append(arr[i])
return quick_sort(left) + [pivot] + quick_sort(right)<|end|>"""

inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True)[len(input_text):])

Repository Level Code Completion

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda()

input_text = """#utils.py
import torch
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

def load_data():
    iris = datasets.load_iris()
    X = iris.data
    y = iris.target

    # Standardize the data
    scaler = StandardScaler()
    X = scaler.fit_transform(X)

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    # Convert numpy data to PyTorch tensors
    X_train = torch.tensor(X_train, dtype=torch.float32)
    X_test = torch.tensor(X_test, dtype=torch.float32)
    y_train = torch.tensor(y_train, dtype=torch.int64)
    y_test = torch.tensor(y_test, dtype=torch.int64)

     return X_train, X_test, y_train, y_test

def evaluate_predictions(y_test, y_pred):
    return accuracy_score(y_test, y_pred)
#model.py
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

class IrisClassifier(nn.Module):
    def __init__(self):
        super(IrisClassifier, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(4, 16),
            nn.ReLU(),
            nn.Linear(16, 3)
        )

    def forward(self, x):
        return self.fc(x)

    def train_model(self, X_train, y_train, epochs, lr, batch_size):
        criterion = nn.CrossEntropyLoss()
        optimizer = optim.Adam(self.parameters(), lr=lr)

        # Create DataLoader for batches
        dataset = TensorDataset(X_train, y_train)
        dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

        for epoch in range(epochs):
            for batch_X, batch_y in dataloader:
                optimizer.zero_grad()
                outputs = self(batch_X)
                loss = criterion(outputs, batch_y)
                loss.backward()
                optimizer.step()

    def predict(self, X_test):
        with torch.no_grad():
            outputs = self(X_test)
            _, predicted = outputs.max(1)
        return predicted.numpy()
#main.py
from utils import load_data, evaluate_predictions
from model import IrisClassifier as Classifier

def main():
    # Model training and evaluation
"""

inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=140)
print(tokenizer.decode(outputs[0]))

License

This code repository is licensed under the MIT License. The use of Deepseek Coder models is subject to the Model License. DeepSeek Coder supports commercial use.

See the LICENSE-MODEL for more details.

Contact

If you have any questions, please raise an issue or contact us at agi_code@deepseek.com.

Suggested labels

{ "key": "llm-experiments", "value": "Experiments and results related to Large Language Models" } { "key": "AI-Chatbots", "value": "Topics related to advanced chatbot platforms integrating multiple AI models" }

#326: Assisted Generation: a new direction toward low-latency text generation

### DetailsSimilarity score: 0.84 > **Assisted Generation: a new direction toward low-latency text generation**

Greedy decoding with assisted generation

Assisted generation is a balancing act. You want the assistant to quickly generate a candidate sequence while being as accurate as possible. If the assistant has poor quality, your get the cost of using the assistant model with little to no benefits. On the other hand, optimizing the quality of the candidate sequences may imply the use of slow assistants, resulting in a net slowdown. While we can't automate the selection of the assistant model for you, we’ve included an additional requirement and a heuristic to ensure the time spent with the assistant stays in check.

First, the requirement – the assistant must have the exact same tokenizer as your model. If this requirement was not in place, expensive token decoding and re-encoding steps would have to be added. Furthermore, these additional steps would have to happen on the CPU, which in turn may need slow inter-device data transfers. Fast usage of the assistant is critical for the benefits of assisted generation to show up.

Finally, the heuristic. By this point, you have probably noticed the similarities between the movie Inception and assisted generation – you are, after all, running text generation inside text generation. There will be one assistant model forward pass per candidate token, and we know that forward passes are expensive. While you can’t know in advance the number of tokens that the assistant model will get right, you can keep track of this information and use it to limit the number of candidate tokens requested to the assistant – some sections of the output are easier to anticipate than others.

Wrapping all up, here’s our original implementation of the assisted generation loop (code):

  1. Use greedy decoding to generate a certain number of candidate tokens with the assistant model, producing candidates. The number of produced candidate tokens is initialized to 5 the first time assisted generation is called.
  2. Using our model, do a forward pass with candidates, obtaining logits.
  3. Use the token selection method (.argmax() for greedy search or .multinomial() for sampling) to get the next_tokens from logits.
  4. Compare next_tokens to candidates and get the number of matching tokens. Remember that this comparison has to be done with left-to-right causality: after the first mismatch, all candidates are invalidated.
  5. Use the number of matches to slice things up and discard variables related to unconfirmed candidate tokens. In essence, in next_tokens, keep the matching tokens plus the first divergent token (which our model generates from a valid candidate subsequence).
  6. Adjust the number of candidate tokens to be produced in the next iteration — our original heuristic increases it by 2 if ALL tokens match and decreases it by 1 otherwise.

We’ve designed the API in 🤗 Transformers such that this process is hassle-free for you. All you need to do is to pass the assistant model under the new assistant_model keyword argument and reap the latency gains! At the time of the release of this blog post, assisted generation is limited to a batch size of 1.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

prompt = "Alice and Bob"
checkpoint = "EleutherAI/pythia-1.4b-deduped"
assistant_checkpoint = "EleutherAI/pythia-160m-deduped"
device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
inputs = tokenizer(prompt, return_tensors="pt").to(device)

model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint).to(device)
outputs = model.generate(**inputs, assistant_model=assistant_model)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
# ['Alice and Bob are sitting in a bar. Alice is drinking a beer and Bob is drinking a']

Is the additional internal complexity worth it? Let’s have a look at the latency numbers for the greedy decoding case (results for sampling are in the next section), considering a batch size of 1. These results were pulled directly out of 🤗 Transformers without any additional optimizations, so you should be able to reproduce them in your setup.

Assisted Generation Benchmark

OPT: Open OPT: Summ Whisper: ARS CodeGen: Code Flan-T5: Summ
GPU
Omit cases with memory offload? Yes No Image
Assistant Model facebook/opt-125m Model Names: 1.3B: facebook/opt-1.3b 6.7B: facebook/opt-6.7b
30B: facebook/opt-30b 66B: facebook/opt-66b Dataset used as input prompt: C4 (en, validation set) joaogante/assisted_generation_benchmarks
built with Gradio. Hosted on Spaces

Glancing at the collected numbers, we see that assisted generation can deliver significant latency reductions in diverse settings, but it is not a silver bullet – you should benchmark it before applying it to your use case. We can conclude that assisted generation:

  • 🤏 Requires access to an assistant model that is at least an order of magnitude smaller than your model (the bigger the difference, the better);
  • 🚀 Gets up to 3x speedups in the presence of INT8 and up to 2x otherwise, when the model fits in the GPU memory;
  • 🤯 If you’re playing with models that do not fit in your GPU and are relying on memory offloading, you can see up to 10x speedups;
  • 📄 Shines in input-grounded tasks, like automatic speech recognition or summarization.

Sample with assisted generation

Greedy decoding is suited for input-grounded tasks (automatic speech recognition, translation, summarization, ...) or factual knowledge-seeking. Open-ended tasks requiring large levels of creativity, such as most uses of a language model as a chatbot, should use sampling instead. Assisted generation is naturally designed for greedy decoding, but that doesn’t mean that you can’t use assisted generation with multinomial sampling!

Drawing samples from a probability distribution for the next token will cause our greedy assistant to fail more often, reducing its latency benefits. However, we can control how sharp the probability distribution for the next tokens is, using the temperature coefficient that’s present in most sampling-based applications. At one extreme, with temperatures close to 0, sampling will approximate greedy decoding, favoring the most likely token. At the other extreme, with the temperature set to values much larger than 1, sampling will be chaotic, drawing from a uniform distribution. Low temperatures are, therefore, more favorable to your assistant model, retaining most of the latency benefits from assisted generation, as we can see below.

Suggested labels

{ "key": "assisted-generation", "value": "Text generation with the use of an assistant model for latency reduction" }

#363: LongLM: Self-Extend LLM Context Window Without Tuning

### DetailsSimilarity score: 0.83 - [ ] [datamllab/LongLM: LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning](https://github.com/datamllab/LongLM)

LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning

Implementation of the proposed SelfExtend in LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning. If you find our method useful, please kindly cite our paper.

@misc{jin2024llm,
title={LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning},
author={Hongye Jin and Xiaotian Han and Jingfeng Yang and Zhimeng Jiang and Zirui Liu and Chia-Yuan Chang and Huiyuan Chen and Xia Hu},
year={2024},
eprint={2401.01325},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Updates:

[01/11/2024]: We've tested the implementation for phi-2. It works. You may find some results on this Reddit post
[01/08/2024]: Add third-party implementations section
[01/07/2024]: Add Implementation for Mistral
[01/05/2024]: Our proposed method is discussed on this Reddit post

  1. Overview

This work elicits LLMs' inherent ability to handle long contexts without fine-tuning. The limited length of the training sequence during training may limit the application of Large Language Models (LLMs) on long input sequences for inference. In this work, we argue that existing LLMs themselves have inherent capabilities for handling long contexts. Based on this argument, we suggest extending LLMs' context window by themselves to fully utilize the inherent ability. We propose Self-Extend to stimulate LLMs' long context handling potential. The basic idea is to construct bi-level attention information: the group level and the neighbor level. The two levels are computed by the original model's self-attention, which means the proposed does not require any training.

Suggested labels

{ "key": "llm-self-extension", "value": "Discussing LLM's inherent capability to handle long contexts without fine-tuning and proposing SelfExtend to stimulate LLM's long context handling potential" }

@irthomasthomas irthomasthomas changed the title HongyeJ on X: "Despite the mixed feelings about Google's latest Gemma model, we're big fans! @GoogleAI Why? Coz we found it pairs incredibly well with our SelfExtend 🤣🤣🤣 - like, perfectly! With Self-Extend, no fine-tuning needed, we effortlessly expanded Gemma's window from 8k to 90k+! On… https://t.co/6hctG16gjd" / X Gemma llm context window extended to 90k using LongLM Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AI-Chatbots Topics related to advanced chatbot platforms integrating multiple AI models ai-leaderboards leaderdoards for llm's and other ml models Algorithms Sorting, Learning or Classifying. All algorithms go here. llm Large Language Models New-Label Choose this option if the existing labels are insufficient to describe the content accurately Papers Research papers Research personal research notes for a topic
Projects
None yet
Development

No branches or pull requests

1 participant