-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gemma llm context window extended to 90k using LongLM #661
Comments
Related issues#625: unsloth/README.md at main · unslothai/unsloth### DetailsSimilarity score: 0.87 - [ ] [unsloth/README.md at main · unslothai/unsloth](https://github.com/unslothai/unsloth/blob/main/README.md?plain=1)unsloth/README.md at main · unslothai/unsloth✨ Finetune for FreeAll notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face.
🦥 Unsloth.ai News
🔗 Links and Resources
⭐ Key Features
🥇 Performance Benchmarking
Suggested labels#647: Qwen-1.5-8x7B : r/LocalLLaMA### DetailsSimilarity score: 0.85 - [ ] [Qwen-1.5-8x7B : r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/1atw4ud/qwen158x7b/)TITLE: Qwen-1.5-8x7B : r/LocalLLaMADESCRIPTION: "Qwen-1.5-8x7B New Model Model: Link to Model Dataset: Link to Dataset Thread: I'm excited to release a project I've been working on the last couple of weeks. Qwen1.5-8x7b: Link to Model And the accompanying dataset created with the intention of encouraging MoE models to organically develop their own experts: Link to Dataset The purpose and intention behind this project is better detailed in the model/dataset card, but basically: I curated a diverse dataset from the highest quality conversations I could find. It's actually great. All sources are included in the dataset card. I then trained Qwen1.5-7b on a 100k subset over 4 epochs. Took that and made a MoE using @maximelabonne 's lazymergekit, utilizing a random gate and no base model. Trained that on another 351,000 pairs. I had planned on doing 4 full epochs, but @runpod_io had cuda errors in my machine 3x, expending the rest of my budget for the project after only 0.45/4 epochs. Good news: Model is surprisingly awesome even at such a (comparatively) small training set size. Reasoning compares with Mixtral in my (very basic) tests. Will benchmark it properly once runpod situation gets sorted, and plan to finish the rest of the training. Thank you to @teknium1 , @jon_durbin , @erhartford , Maxime Labonne, and @chargoddard for their contributions to open source AI and making these processes accessible and transparent. And of course thank you to @mistralai for inspiring this work and @alibaba_cloud for releasing the weights of the Qwen1.5 family. Teknium and Eric Hartford have been especially helpful, answering questions with humility and generosity. We're just getting started." URL: Link to Reddit Post Suggested labels{'label-name': 'MoE-model', 'label-description': 'Refers to a Mixture of Experts model created by merging and finetuning Qwen1.5-7B.', 'gh-repo': 'llm', 'confidence': 52.49}#499: marella/ctransformers: Python bindings for the Transformer models implemented in C/C++ using GGML library.### DetailsSimilarity score: 0.85 - [ ] [marella/ctransformers: Python bindings for the Transformer models implemented in C/C++ using GGML library.](https://github.com/marella/ctransformers?tab=readme-ov-file#gptq)CTransformers
Python bindings for the Transformer models implemented in C/C++ using GGML library. Also see ChatDocs Supported Models
InstallationTo install via
UsageIt provides a unified interface for all models: from ctransformers import AutoModelForCausalLM
llm = AutoModelForCausalLM.from_pretrained("/path/to/ggml-model.bin", model_type="gpt2")
print(llm("AI is going to")) Run in Google Colab To stream the output: for text in llm("AI is going to", stream=True):
print(text, end="", flush=True) You can load models from Hugging Face Hub directly: llm = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml") If a model repo has multiple model files ( llm = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", model_file="ggml-model.bin") 🤗 TransformersNote: This is an experimental feature and may change in the future. To use with 🤗 Transformers, create the model and tokenizer using: from ctransformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", hf=True)
tokenizer = AutoTokenizer.from_pretrained(model) Run in Google Colab You can use 🤗 Transformers text generation pipeline: from transformers import pipeline
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
print(pipe("AI is going to", max_new_tokens=256)) You can use 🤗 Transformers generation parameters: pipe("AI is going to", max_new_tokens=256, do_sample=True, temperature=0.8, repetition_penalty=1.1) You can use 🤗 Transformers tokenizers: from ctransformers import AutoModelForCausalLM
from transformers import AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", hf=True) # Load model from GGML model repo.
tokenizer = AutoTokenizer.from_pretrained("gpt2") # Load tokenizer from original model repo. LangChainIt is integrated into LangChain. See LangChain docs. GPUTo run some of the model layers on GPU, set the llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GGML", gpu_layers=50) Run in Google Colab CUDAInstall CUDA libraries using: pip install ctransformers[cuda] ROCmTo enable ROCm support, install the CT_HIPBLAS=1 pip install ctransformers --no-binary ctransformers MetalTo enable Metal support, install the CT_METAL=1 pip install ctransformers --no-binary ctransformers GPTQNote: This is an experimental feature and only LLaMA models are supported using [ExLlama](https Install additional dependencies using: pip install ctransformers[gptq] Load a GPTQ model using: llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GPTQ") Run in Google Colab If the model name or path doesn't contain the word It can also be used with LangChain. Low-level APIs are not fully supported. DocumentationFind the documentation on Read the Docs. Config
Find the URL for the model card for GPTQ here. Made with ❤️ by marella Suggested labelsnull#383: deepseek-ai/deepseek-coder-5.7bmqa-base · Hugging Face### DetailsSimilarity score: 0.84 - [ ] [deepseek-ai/deepseek-coder-5.7bmqa-base · Hugging Face](https://huggingface.co/deepseek-ai/deepseek-coder-5.7bmqa-base)Deepseek Coder IntroductionDeepseek Coder is a series of code language models, each trained from scratch on 2T tokens with a composition of 87% code and 13% natural language in both English and Chinese. We provide various sizes of the code model, ranging from 1B to 33B versions. Each model is pre-trained on a project-level code corpus with a window size of 16K and an extra fill-in-the-blank task, supporting project-level code completion and infilling. Deepseek Coder achieves state-of-the-art performance among open-source code models on multiple programming languages and various benchmarks. Key Features
Model Summary
How to UseThis section provides examples of how to use the Deepseek Coder model for code completion, code insertion, and repository-level code completion tasks. Code Completionfrom transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda()
input_text = "#write a quick sort algorithm"
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) Code Insertionfrom transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda()
input_text = """<|begin|>def quick_sort(arr):
if len(arr) <= 1:
return arr
pivot = arr[0]
left = []
right = []
<|hole|>
if arr[i] < pivot:
left.append(arr[i])
else:
right.append(arr[i])
return quick_sort(left) + [pivot] + quick_sort(right)<|end|>"""
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True)[len(input_text):]) Repository Level Code Completionfrom transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda()
input_text = """#utils.py
import torch
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
def load_data():
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Standardize the data
scaler = StandardScaler()
X = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Convert numpy data to PyTorch tensors
X_train = torch.tensor(X_train, dtype=torch.float32)
X_test = torch.tensor(X_test, dtype=torch.float32)
y_train = torch.tensor(y_train, dtype=torch.int64)
y_test = torch.tensor(y_test, dtype=torch.int64)
return X_train, X_test, y_train, y_test
def evaluate_predictions(y_test, y_pred):
return accuracy_score(y_test, y_pred)
#model.py
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
class IrisClassifier(nn.Module):
def __init__(self):
super(IrisClassifier, self).__init__()
self.fc = nn.Sequential(
nn.Linear(4, 16),
nn.ReLU(),
nn.Linear(16, 3)
)
def forward(self, x):
return self.fc(x)
def train_model(self, X_train, y_train, epochs, lr, batch_size):
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(self.parameters(), lr=lr)
# Create DataLoader for batches
dataset = TensorDataset(X_train, y_train)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
for epoch in range(epochs):
for batch_X, batch_y in dataloader:
optimizer.zero_grad()
outputs = self(batch_X)
loss = criterion(outputs, batch_y)
loss.backward()
optimizer.step()
def predict(self, X_test):
with torch.no_grad():
outputs = self(X_test)
_, predicted = outputs.max(1)
return predicted.numpy()
#main.py
from utils import load_data, evaluate_predictions
from model import IrisClassifier as Classifier
def main():
# Model training and evaluation
"""
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=140)
print(tokenizer.decode(outputs[0])) LicenseThis code repository is licensed under the MIT License. The use of Deepseek Coder models is subject to the Model License. DeepSeek Coder supports commercial use. See the LICENSE-MODEL for more details. ContactIf you have any questions, please raise an issue or contact us at agi_code@deepseek.com. Suggested labels{ "key": "llm-experiments", "value": "Experiments and results related to Large Language Models" } { "key": "AI-Chatbots", "value": "Topics related to advanced chatbot platforms integrating multiple AI models" }#326: Assisted Generation: a new direction toward low-latency text generation### DetailsSimilarity score: 0.84 > **Assisted Generation: a new direction toward low-latency text generation**Greedy decoding with assisted generation Assisted generation is a balancing act. You want the assistant to quickly generate a candidate sequence while being as accurate as possible. If the assistant has poor quality, your get the cost of using the assistant model with little to no benefits. On the other hand, optimizing the quality of the candidate sequences may imply the use of slow assistants, resulting in a net slowdown. While we can't automate the selection of the assistant model for you, we’ve included an additional requirement and a heuristic to ensure the time spent with the assistant stays in check. First, the requirement – the assistant must have the exact same tokenizer as your model. If this requirement was not in place, expensive token decoding and re-encoding steps would have to be added. Furthermore, these additional steps would have to happen on the CPU, which in turn may need slow inter-device data transfers. Fast usage of the assistant is critical for the benefits of assisted generation to show up. Finally, the heuristic. By this point, you have probably noticed the similarities between the movie Inception and assisted generation – you are, after all, running text generation inside text generation. There will be one assistant model forward pass per candidate token, and we know that forward passes are expensive. While you can’t know in advance the number of tokens that the assistant model will get right, you can keep track of this information and use it to limit the number of candidate tokens requested to the assistant – some sections of the output are easier to anticipate than others. Wrapping all up, here’s our original implementation of the assisted generation loop (code):
We’ve designed the API in 🤗 Transformers such that this process is hassle-free for you. All you need to do is to pass the assistant model under the new from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
prompt = "Alice and Bob"
checkpoint = "EleutherAI/pythia-1.4b-deduped"
assistant_checkpoint = "EleutherAI/pythia-160m-deduped"
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
inputs = tokenizer(prompt, return_tensors="pt").to(device)
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint).to(device)
outputs = model.generate(**inputs, assistant_model=assistant_model)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
# ['Alice and Bob are sitting in a bar. Alice is drinking a beer and Bob is drinking a'] Is the additional internal complexity worth it? Let’s have a look at the latency numbers for the greedy decoding case (results for sampling are in the next section), considering a batch size of 1. These results were pulled directly out of 🤗 Transformers without any additional optimizations, so you should be able to reproduce them in your setup. Assisted Generation Benchmark
Glancing at the collected numbers, we see that assisted generation can deliver significant latency reductions in diverse settings, but it is not a silver bullet – you should benchmark it before applying it to your use case. We can conclude that assisted generation:
Sample with assisted generation Greedy decoding is suited for input-grounded tasks (automatic speech recognition, translation, summarization, ...) or factual knowledge-seeking. Open-ended tasks requiring large levels of creativity, such as most uses of a language model as a chatbot, should use sampling instead. Assisted generation is naturally designed for greedy decoding, but that doesn’t mean that you can’t use assisted generation with multinomial sampling! Drawing samples from a probability distribution for the next token will cause our greedy assistant to fail more often, reducing its latency benefits. However, we can control how sharp the probability distribution for the next tokens is, using the temperature coefficient that’s present in most sampling-based applications. At one extreme, with temperatures close to 0, sampling will approximate greedy decoding, favoring the most likely token. At the other extreme, with the temperature set to values much larger than 1, sampling will be chaotic, drawing from a uniform distribution. Low temperatures are, therefore, more favorable to your assistant model, retaining most of the latency benefits from assisted generation, as we can see below. Suggested labels{ "key": "assisted-generation", "value": "Text generation with the use of an assistant model for latency reduction" }#363: LongLM: Self-Extend LLM Context Window Without Tuning### DetailsSimilarity score: 0.83 - [ ] [datamllab/LongLM: LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning](https://github.com/datamllab/LongLM)LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning Implementation of the proposed SelfExtend in LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning. If you find our method useful, please kindly cite our paper. @misc{jin2024llm, [01/11/2024]: We've tested the implementation for phi-2. It works. You may find some results on this Reddit post
This work elicits LLMs' inherent ability to handle long contexts without fine-tuning. The limited length of the training sequence during training may limit the application of Large Language Models (LLMs) on long input sequences for inference. In this work, we argue that existing LLMs themselves have inherent capabilities for handling long contexts. Based on this argument, we suggest extending LLMs' context window by themselves to fully utilize the inherent ability. We propose Self-Extend to stimulate LLMs' long context handling potential. The basic idea is to construct bi-level attention information: the group level and the neighbor level. The two levels are computed by the original model's self-attention, which means the proposed does not require any training. Suggested labels{ "key": "llm-self-extension", "value": "Discussing LLM's inherent capability to handle long contexts without fine-tuning and proposing SelfExtend to stimulate LLM's long context handling potential" } |
HongyeJ on X:
"Despite the mixed feelings about Google's latest Gemma model, we're big fans! @googleai Why? Coz we found it pairs incredibly well with our SelfExtend 🤣🤣🤣 - like, perfectly! With Self-Extend, no fine-tuning needed, we effortlessly expanded Gemma's window from 8k to 90k+! On… Link to tweet"
X
Description:
"Despite the mixed feelings about Google's latest Gemma model, we're big fans! @googleai Why? Coz we found it pairs incredibly well with our SelfExtend - like, perfectly! With Self-Extend, no fine-tuning needed, we effortlessly expanded Gemma's window from 8k to 90k+! On the 'Needle in the haystack' task, Gemma-2b-it even struggled at 8k, but with SelfExtend, Gemma-2b-it easily tackles it within 90k range! #AI #Gemma #SelfExtend #LLMs
Paper: Link to paper"
URL: Link to tweet
Suggested labels
{'label-description': 'Innovative AI model applications', 'label-name': 'AI-Applications', 'gh-repo': 'content-labels', 'confidence': 65.97}
The text was updated successfully, but these errors were encountered: