Gemma llm context window extended to 90k using LongLM #661

irthomasthomas · 2024-02-28T18:50:46Z

HongyeJ on X: "Despite the mixed feelings about Google's latest Gemma model, we're big fans! @GoogleAI Why? Coz we found it pairs incredibly well with our SelfExtend 🤣🤣🤣 - like, perfectly! With Self-Extend, no fine-tuning needed, we effortlessly expanded Gemma's window from 8k to 90k+! On… https://t.co/6hctG16gjd" / X

HongyeJ on X:

"Despite the mixed feelings about Google's latest Gemma model, we're big fans! @googleai Why? Coz we found it pairs incredibly well with our SelfExtend 🤣🤣🤣 - like, perfectly! With Self-Extend, no fine-tuning needed, we effortlessly expanded Gemma's window from 8k to 90k+! On… Link to tweet"

X

Description:
"Despite the mixed feelings about Google's latest Gemma model, we're big fans! @googleai Why? Coz we found it pairs incredibly well with our SelfExtend - like, perfectly! With Self-Extend, no fine-tuning needed, we effortlessly expanded Gemma's window from 8k to 90k+! On the 'Needle in the haystack' task, Gemma-2b-it even struggled at 8k, but with SelfExtend, Gemma-2b-it easily tackles it within 90k range! #AI #Gemma #SelfExtend #LLMs

Paper: Link to paper"

URL: Link to tweet

Suggested labels

{'label-description': 'Innovative AI model applications', 'label-name': 'AI-Applications', 'gh-repo': 'content-labels', 'confidence': 65.97}

irthomasthomas · 2024-02-28T18:50:48Z

Related issues

#625: unsloth/README.md at main · unslothai/unsloth

### Details

Similarity score: 0.87 - [ ] [unsloth/README.md at main · unslothai/unsloth](https://github.com/unslothai/unsloth/blob/main/README.md?plain=1)

unsloth/README.md at main · unslothai/unsloth

Finetune Mistral, Gemma, Llama 2-5x faster with 70% less memory!

✨ Finetune for Free

All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face.

Unsloth supports	Free Notebooks	Performance	Memory use
Gemma 7b	▶️ Start on Colab	2.4x faster	58% less
Mistral 7b	▶️ Start on Colab	2.2x faster	62% less
Llama-2 7b	▶️ Start on Colab	2.2x faster	43% less
TinyLlama	▶️ Start on Colab	3.9x faster	74% less
CodeLlama 34b A100	▶️ Start on Colab	1.9x faster	27% less
Mistral 7b 1xT4	▶️ Start on Kaggle	5x faster*	62% less
DPO - Zephyr	▶️ Start on Colab	1.9x faster	19% less

This conversational notebook is useful for ShareGPT ChatML / Vicuna templates.
This text completion notebook is for raw text. This DPO notebook replicates Zephyr.
* Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster.

🦥 Unsloth.ai News

📣 Gemma 7b on 6T tokens now works. And Gemma 2b notebook
📣 Added conversational notebooks and raw text notebooks
📣 2x faster inference added for all our models
📣 DPO support is now included. More info on DPO
📣 We did a blog with 🤗Hugging Face and are in their official docs! Check out the SFT docs and DPO docs
📣 Download models 4x faster from 🤗Hugging Face. Eg: unsloth/mistral-7b-bnb-4bit

🔗 Links and Resources

Type	Links
📚 Wiki & FAQ	Read Our Wiki
📜 Documentation	Read The Doc
💾 Installation	unsloth/README.md
Twitter (aka X)	Follow us on X
🥇 Benchmarking	Performance Tables
🌐 Released Models	Unsloth Releases
✍️ Blog	Read our Blogs

⭐ Key Features

All kernels written in OpenAI's Triton language. Manual backprop engine.
0% loss in accuracy - no approximation methods - all exact.
No change of hardware. Supports NVIDIA GPUs since 2018+. Minimum CUDA Capability 7.0 (V100, T4, Titan V, RTX 20, 30, 40x, A100, H100, L40 etc) Check your GPU! GTX 1070, 1080 works, but is slow.
Works on Linux and Windows via WSL.
Supports 4bit and 16bit QLoRA / LoRA finetuning via bitsandbytes.
Open source trains 5x faster - see Unsloth Pro for 30x faster training!
If you trained a model with 🦥Unsloth, you can use this cool sticker!

🥇 Performance Benchmarking

For the full list of reproducable benchmarking tables, go to our website

1 A100 40GB	🤗Hugging Face	Flash Attention	🦥Unsloth Open Source	🦥Unsloth Pro
Alpaca	1x	1.04x	1.98x	15.64x
LAION Chip2	1x	0.92x	1.61x	20.73x
OASST	1x	1.19x	2.17x	14.83x
Slim Orca	1x	1.18x	2.22x	14.82x

Benchmarking table below was conducted by 🤗Hugging Face.

Free Colab T4	Dataset	🤗Hugging Face	Pytorch 2.1.1	🦥Unsloth	🦥 VRAM reduction
Llama-2 7b	OASST	1x	1.19x	1.95x	-43.3%
Mistral 7b	Alpaca	1x	1.07x	1.56x	-13.7%
Tiny Llama 1.1b	Alpaca	1x	2.06x	3.87x	-73.8%
DPO with Zephyr	Ultra Chat	1x	1.09x	1.55x	-18.6%

View on GitHub

Suggested labels

#647: Qwen-1.5-8x7B : r/LocalLLaMA

### Details

Similarity score: 0.85 - [ ] [Qwen-1.5-8x7B : r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/1atw4ud/qwen158x7b/)

TITLE: Qwen-1.5-8x7B : r/LocalLLaMA

DESCRIPTION: "Qwen-1.5-8x7B

New Model
Someone created a sparse MoE Qwen model by merging and finetuning Qwen1.5-7B

Model: Link to Model

Dataset: Link to Dataset

Thread:

I'm excited to release a project I've been working on the last couple of weeks.

Qwen1.5-8x7b: Link to Model

And the accompanying dataset created with the intention of encouraging MoE models to organically develop their own experts: Link to Dataset

The purpose and intention behind this project is better detailed in the model/dataset card, but basically:

I curated a diverse dataset from the highest quality conversations I could find. It's actually great. All sources are included in the dataset card.

I then trained Qwen1.5-7b on a 100k subset over 4 epochs.

Took that and made a MoE using @maximelabonne 's lazymergekit, utilizing a random gate and no base model.

Trained that on another 351,000 pairs. I had planned on doing 4 full epochs, but @runpod_io had cuda errors in my machine 3x, expending the rest of my budget for the project after only 0.45/4 epochs.

Good news:

Model is surprisingly awesome even at such a (comparatively) small training set size. Reasoning compares with Mixtral in my (very basic) tests.

Will benchmark it properly once runpod situation gets sorted, and plan to finish the rest of the training.

Thank you to @teknium1 , @jon_durbin , @erhartford , Maxime Labonne, and @chargoddard for their contributions to open source AI and making these processes accessible and transparent. And of course thank you to @mistralai for inspiring this work and @alibaba_cloud for releasing the weights of the Qwen1.5 family.

Teknium and Eric Hartford have been especially helpful, answering questions with humility and generosity.

We're just getting started."

URL: Link to Reddit Post

Suggested labels

{'label-name': 'MoE-model', 'label-description': 'Refers to a Mixture of Experts model created by merging and finetuning Qwen1.5-7B.', 'gh-repo': 'llm', 'confidence': 52.49}

#499: marella/ctransformers: Python bindings for the Transformer models implemented in C/C++ using GGML library.

### Details

Similarity score: 0.85 - [ ] [marella/ctransformers: Python bindings for the Transformer models implemented in C/C++ using GGML library.](https://github.com/marella/ctransformers?tab=readme-ov-file#gptq)

CTransformers

![Build and Test](https://github.com/ marella / ctransformers / actions / workflows / build.yml / badge.svg)

Python bindings for the Transformer models implemented in C/C++ using GGML library. Also see ChatDocs

Supported Models

Model	Model Type	CUDA	Metal
GPT-2	gpt2
GPT-J, GPT4All-J	gptj
GPT-NeoX, StableLM	gpt_neox
Falcon	falcon	✅
LLaMA, LLaMA 2	llamai	✅	✅
MPT	mpt	✅
StarCoder, StarChat	gpt_bigcode	✅
Dolly V2	dolly-v2
Replit	replit

Installation

To install via pip, simply run:

pip install ctransformers

Usage

It provides a unified interface for all models:

from ctransformers import AutoModelForCausalLM

llm = AutoModelForCausalLM.from_pretrained("/path/to/ggml-model.bin", model_type="gpt2")

print(llm("AI is going to"))

Run in Google Colab

To stream the output:

for text in llm("AI is going to", stream=True):
    print(text, end="", flush=True)

You can load models from Hugging Face Hub directly:

llm = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml")

If a model repo has multiple model files (.bin or .gguf files), specify a model file using:

llm = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", model_file="ggml-model.bin")

🤗 Transformers

Note: This is an experimental feature and may change in the future.

To use with 🤗 Transformers, create the model and tokenizer using:

from ctransformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", hf=True)
tokenizer = AutoTokenizer.from_pretrained(model)

Run in Google Colab

You can use 🤗 Transformers text generation pipeline:

from transformers import pipeline

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
print(pipe("AI is going to", max_new_tokens=256))

You can use 🤗 Transformers generation parameters:

pipe("AI is going to", max_new_tokens=256, do_sample=True, temperature=0.8, repetition_penalty=1.1)

You can use 🤗 Transformers tokenizers:

from ctransformers import AutoModelForCausalLM
from transformers import AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", hf=True)  # Load model from GGML model repo.
tokenizer = AutoTokenizer.from_pretrained("gpt2")  # Load tokenizer from original model repo.

LangChain

It is integrated into LangChain. See LangChain docs.

GPU

To run some of the model layers on GPU, set the gpu_layers parameter:

llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GGML", gpu_layers=50)

Run in Google Colab

CUDA

Install CUDA libraries using:

pip install ctransformers[cuda]

ROCm

To enable ROCm support, install the ctransformers package using:

CT_HIPBLAS=1 pip install ctransformers --no-binary ctransformers

Metal

To enable Metal support, install the ctransformers package using:

CT_METAL=1 pip install ctransformers --no-binary ctransformers

GPTQ

Note: This is an experimental feature and only LLaMA models are supported using [ExLlama](https
://github.com/TheLastBen/exllama).

Install additional dependencies using:

pip install ctransformers[gptq]

Load a GPTQ model using:

llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GPTQ")

Run in Google Colab

If the model name or path doesn't contain the word gptq, specify model_type="gptq".

It can also be used with LangChain. Low-level APIs are not fully supported.

Documentation

Find the documentation on Read the Docs.

Config

Parameter	Type	Description	Default
`top_k`	`int`	The top-k value to use for sampling	`40`
`top_p`	`float`	The top-p value to use for sampling	`0.95`
`temperature`	`float`	The temperature to use for sampling	`0.8`
`repetition_penalty`	`float`	The repetition penalty to use for sampling	`1.1`
`last_n_tokens`	`int`	The number of last tokens to use for repetition penalty	`64`
`seed`	`int`	The seed value to use for sampling tokens	`-1`
`max_new_tokens`	`int`	The maximum number of new tokens to generate	`256`
`stop`	`List`	A list of sequences to stop generation when encountered	`None`
`stream`	`bool`	Whether to stream the generated text	`False`
`reset`	`bool`	Whether to reset the model state before generating text	`True`
`batch_size`	`int`	The batch size to use for evaluating tokens in a single prompt	`8`
`threads`	`int`	The number of threads to use for evaluating tokens	`-1`
`context_length`	`int`	The maximum context length to use	`-1`
`gpu_layers`	`int`	The number of layers to run on GPU	`0`

Find the URL for the model card for GPTQ here.

Made with ❤️ by marella

Suggested labels

null

#383: deepseek-ai/deepseek-coder-5.7bmqa-base · Hugging Face

### Details

Similarity score: 0.84 - [ ] [deepseek-ai/deepseek-coder-5.7bmqa-base · Hugging Face](https://huggingface.co/deepseek-ai/deepseek-coder-5.7bmqa-base)

Deepseek Coder Introduction

Deepseek Coder is a series of code language models, each trained from scratch on 2T tokens with a composition of 87% code and 13% natural language in both English and Chinese. We provide various sizes of the code model, ranging from 1B to 33B versions. Each model is pre-trained on a project-level code corpus with a window size of 16K and an extra fill-in-the-blank task, supporting project-level code completion and infilling. Deepseek Coder achieves state-of-the-art performance among open-source code models on multiple programming languages and various benchmarks.

Key Features

Massive Training Data: Trained from scratch on 2T tokens, including 87% code and 13% linguistic data in both English and Chinese languages.
Highly Flexible & Scalable: Offered in model sizes of 1.3B, 5.7B, 6.7B, and 33B, enabling users to choose the setup most suitable for their requirements.
Superior Model Performance: State-of-the-art performance among publicly available code models on HumanEval, MultiPL-E, MBPP, DS-1000, and APPS benchmarks.
Advanced Code Completion Capabilities: A window size of 16K and a fill-in-the-blank task, supporting project-level code completion and infilling tasks.

Model Summary

deepseek-coder-5.7bmqa-base: A 5.7B parameter model with Multi Query Attention, trained on 2 trillion tokens.
Home Page: DeepSeek
Repository: deepseek-ai/deepseek-coder
Chat With DeepSeek Coder: DeepSeek-Coder

How to Use

This section provides examples of how to use the Deepseek Coder model for code completion, code insertion, and repository-level code completion tasks.

Code Completion

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda()

input_text = "#write a quick sort algorithm"
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Code Insertion

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda()

input_text = """<|begin|>def quick_sort(arr):
    if len(arr) <= 1:
        return arr
    pivot = arr[0]
    left = []
    right = []
<|hole|>
    if arr[i] < pivot:
        left.append(arr[i])
    else:
        right.append(arr[i])
return quick_sort(left) + [pivot] + quick_sort(right)<|end|>"""

inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True)[len(input_text):])

Repository Level Code Completion

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda()

input_text = """#utils.py
import torch
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

def load_data():
    iris = datasets.load_iris()
    X = iris.data
    y = iris.target

    # Standardize the data
    scaler = StandardScaler()
    X = scaler.fit_transform(X)

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    # Convert numpy data to PyTorch tensors
    X_train = torch.tensor(X_train, dtype=torch.float32)
    X_test = torch.tensor(X_test, dtype=torch.float32)
    y_train = torch.tensor(y_train, dtype=torch.int64)
    y_test = torch.tensor(y_test, dtype=torch.int64)

     return X_train, X_test, y_train, y_test

def evaluate_predictions(y_test, y_pred):
    return accuracy_score(y_test, y_pred)
#model.py
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

class IrisClassifier(nn.Module):
    def __init__(self):
        super(IrisClassifier, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(4, 16),
            nn.ReLU(),
            nn.Linear(16, 3)
        )

    def forward(self, x):
        return self.fc(x)

    def train_model(self, X_train, y_train, epochs, lr, batch_size):
        criterion = nn.CrossEntropyLoss()
        optimizer = optim.Adam(self.parameters(), lr=lr)

        # Create DataLoader for batches
        dataset = TensorDataset(X_train, y_train)
        dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

        for epoch in range(epochs):
            for batch_X, batch_y in dataloader:
                optimizer.zero_grad()
                outputs = self(batch_X)
                loss = criterion(outputs, batch_y)
                loss.backward()
                optimizer.step()

    def predict(self, X_test):
        with torch.no_grad():
            outputs = self(X_test)
            _, predicted = outputs.max(1)
        return predicted.numpy()
#main.py
from utils import load_data, evaluate_predictions
from model import IrisClassifier as Classifier

def main():
    # Model training and evaluation
"""

inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=140)
print(tokenizer.decode(outputs[0]))

License

This code repository is licensed under the MIT License. The use of Deepseek Coder models is subject to the Model License. DeepSeek Coder supports commercial use.

See the LICENSE-MODEL for more details.

Contact

If you have any questions, please raise an issue or contact us at agi_code@deepseek.com.

Suggested labels

{ "key": "llm-experiments", "value": "Experiments and results related to Large Language Models" } { "key": "AI-Chatbots", "value": "Topics related to advanced chatbot platforms integrating multiple AI models" }

#326: Assisted Generation: a new direction toward low-latency text generation

### Details

Similarity score: 0.84 > **Assisted Generation: a new direction toward low-latency text generation**

Greedy decoding with assisted generation

Assisted generation is a balancing act. You want the assistant to quickly generate a candidate sequence while being as accurate as possible. If the assistant has poor quality, your get the cost of using the assistant model with little to no benefits. On the other hand, optimizing the quality of the candidate sequences may imply the use of slow assistants, resulting in a net slowdown. While we can't automate the selection of the assistant model for you, we’ve included an additional requirement and a heuristic to ensure the time spent with the assistant stays in check.

First, the requirement – the assistant must have the exact same tokenizer as your model. If this requirement was not in place, expensive token decoding and re-encoding steps would have to be added. Furthermore, these additional steps would have to happen on the CPU, which in turn may need slow inter-device data transfers. Fast usage of the assistant is critical for the benefits of assisted generation to show up.

Finally, the heuristic. By this point, you have probably noticed the similarities between the movie Inception and assisted generation – you are, after all, running text generation inside text generation. There will be one assistant model forward pass per candidate token, and we know that forward passes are expensive. While you can’t know in advance the number of tokens that the assistant model will get right, you can keep track of this information and use it to limit the number of candidate tokens requested to the assistant – some sections of the output are easier to anticipate than others.

Wrapping all up, here’s our original implementation of the assisted generation loop (code):

Use greedy decoding to generate a certain number of candidate tokens with the assistant model, producing candidates. The number of produced candidate tokens is initialized to 5 the first time assisted generation is called.
Using our model, do a forward pass with candidates, obtaining logits.
Use the token selection method (.argmax() for greedy search or .multinomial() for sampling) to get the next_tokens from logits.
Compare next_tokens to candidates and get the number of matching tokens. Remember that this comparison has to be done with left-to-right causality: after the first mismatch, all candidates are invalidated.
Use the number of matches to slice things up and discard variables related to unconfirmed candidate tokens. In essence, in next_tokens, keep the matching tokens plus the first divergent token (which our model generates from a valid candidate subsequence).
Adjust the number of candidate tokens to be produced in the next iteration — our original heuristic increases it by 2 if ALL tokens match and decreases it by 1 otherwise.

We’ve designed the API in 🤗 Transformers such that this process is hassle-free for you. All you need to do is to pass the assistant model under the new assistant_model keyword argument and reap the latency gains! At the time of the release of this blog post, assisted generation is limited to a batch size of 1.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

prompt = "Alice and Bob"
checkpoint = "EleutherAI/pythia-1.4b-deduped"
assistant_checkpoint = "EleutherAI/pythia-160m-deduped"
device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
inputs = tokenizer(prompt, return_tensors="pt").to(device)

model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint).to(device)
outputs = model.generate(**inputs, assistant_model=assistant_model)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
# ['Alice and Bob are sitting in a bar. Alice is drinking a beer and Bob is drinking a']

Is the additional internal complexity worth it? Let’s have a look at the latency numbers for the greedy decoding case (results for sampling are in the next section), considering a batch size of 1. These results were pulled directly out of 🤗 Transformers without any additional optimizations, so you should be able to reproduce them in your setup.

Assisted Generation Benchmark

OPT: Open	OPT: Summ	Whisper: ARS	CodeGen: Code	Flan-T5: Summ
GPU
Omit cases with memory offload?	Yes	No	Image
Assistant Model	facebook/opt-125m	Model Names:	1.3B: facebook/opt-1.3b	6.7B: facebook/opt-6.7b
30B: facebook/opt-30b	66B: facebook/opt-66b	Dataset used as input prompt:	C4 (en, validation set)	joaogante/assisted_generation_benchmarks
built with Gradio.	Hosted on Spaces

Glancing at the collected numbers, we see that assisted generation can deliver significant latency reductions in diverse settings, but it is not a silver bullet – you should benchmark it before applying it to your use case. We can conclude that assisted generation:

🤏 Requires access to an assistant model that is at least an order of magnitude smaller than your model (the bigger the difference, the better);
🚀 Gets up to 3x speedups in the presence of INT8 and up to 2x otherwise, when the model fits in the GPU memory;
🤯 If you’re playing with models that do not fit in your GPU and are relying on memory offloading, you can see up to 10x speedups;
📄 Shines in input-grounded tasks, like automatic speech recognition or summarization.

Sample with assisted generation

Greedy decoding is suited for input-grounded tasks (automatic speech recognition, translation, summarization, ...) or factual knowledge-seeking. Open-ended tasks requiring large levels of creativity, such as most uses of a language model as a chatbot, should use sampling instead. Assisted generation is naturally designed for greedy decoding, but that doesn’t mean that you can’t use assisted generation with multinomial sampling!

Drawing samples from a probability distribution for the next token will cause our greedy assistant to fail more often, reducing its latency benefits. However, we can control how sharp the probability distribution for the next tokens is, using the temperature coefficient that’s present in most sampling-based applications. At one extreme, with temperatures close to 0, sampling will approximate greedy decoding, favoring the most likely token. At the other extreme, with the temperature set to values much larger than 1, sampling will be chaotic, drawing from a uniform distribution. Low temperatures are, therefore, more favorable to your assistant model, retaining most of the latency benefits from assisted generation, as we can see below.

Suggested labels

{ "key": "assisted-generation", "value": "Text generation with the use of an assistant model for latency reduction" }

#363: LongLM: Self-Extend LLM Context Window Without Tuning

### Details

Similarity score: 0.83 - [ ] [datamllab/LongLM: LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning](https://github.com/datamllab/LongLM)

LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning

Implementation of the proposed SelfExtend in LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning. If you find our method useful, please kindly cite our paper.

@misc{jin2024llm,
title={LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning},
author={Hongye Jin and Xiaotian Han and Jingfeng Yang and Zhimeng Jiang and Zirui Liu and Chia-Yuan Chang and Huiyuan Chen and Xia Hu},
year={2024},
eprint={2401.01325},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Updates:

[01/11/2024]: We've tested the implementation for phi-2. It works. You may find some results on this Reddit post
[01/08/2024]: Add third-party implementations section
[01/07/2024]: Add Implementation for Mistral
[01/05/2024]: Our proposed method is discussed on this Reddit post

Overview

This work elicits LLMs' inherent ability to handle long contexts without fine-tuning. The limited length of the training sequence during training may limit the application of Large Language Models (LLMs) on long input sequences for inference. In this work, we argue that existing LLMs themselves have inherent capabilities for handling long contexts. Based on this argument, we suggest extending LLMs' context window by themselves to fully utilize the inherent ability. We propose Self-Extend to stimulate LLMs' long context handling potential. The basic idea is to construct bi-level attention information: the group level and the neighbor level. The two levels are computed by the original model's self-attention, which means the proposed does not require any training.

Suggested labels

{ "key": "llm-self-extension", "value": "Discussing LLM's inherent capability to handle long contexts without fine-tuning and proposing SelfExtend to stimulate LLM's long context handling potential" }

Gemma llm context window extended to 90k using LongLM #661

Gemma llm context window extended to 90k using LongLM #661

Comments

irthomasthomas commented Feb 28, 2024

HongyeJ on X:

Suggested labels

{'label-description': 'Innovative AI model applications', 'label-name': 'AI-Applications', 'gh-repo': 'content-labels', 'confidence': 65.97}

irthomasthomas commented Feb 28, 2024

Related issues

#625: unsloth/README.md at main · unslothai/unsloth

unsloth/README.md at main · unslothai/unsloth

Finetune Mistral, Gemma, Llama 2-5x faster with 70% less memory!

✨ Finetune for Free

🦥 Unsloth.ai News

🔗 Links and Resources

⭐ Key Features

🥇 Performance Benchmarking

Suggested labels

#647: Qwen-1.5-8x7B : r/LocalLLaMA

TITLE: Qwen-1.5-8x7B : r/LocalLLaMA

Suggested labels

{'label-name': 'MoE-model', 'label-description': 'Refers to a Mixture of Experts model created by merging and finetuning Qwen1.5-7B.', 'gh-repo': 'llm', 'confidence': 52.49}

#499: marella/ctransformers: Python bindings for the Transformer models implemented in C/C++ using GGML library.

CTransformers

Supported Models

Installation

Usage

🤗 Transformers

LangChain

GPU

CUDA

ROCm

Metal

GPTQ

Documentation

Config

Suggested labels

null

#383: deepseek-ai/deepseek-coder-5.7bmqa-base · Hugging Face

Deepseek Coder Introduction

Key Features

Model Summary

How to Use

Code Completion

Code Insertion

Repository Level Code Completion

License

Contact

Suggested labels

{ "key": "llm-experiments", "value": "Experiments and results related to Large Language Models" } { "key": "AI-Chatbots", "value": "Topics related to advanced chatbot platforms integrating multiple AI models" }

#326: Assisted Generation: a new direction toward low-latency text generation

Suggested labels

{ "key": "assisted-generation", "value": "Text generation with the use of an assistant model for latency reduction" }

#363: LongLM: Self-Extend LLM Context Window Without Tuning

Suggested labels

{ "key": "llm-self-extension", "value": "Discussing LLM's inherent capability to handle long contexts without fine-tuning and proposing SelfExtend to stimulate LLM's long context handling potential" }