-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Qwen-1.5-8x7B : r/LocalLLaMA #647
Comments
Related issues#389: AWQ Quantization support - New generic converter for all HF llama-like models - Tutorials - OpenNMT### DetailsSimilarity score: 0.88 - [ ] [AWQ Quantization support - New generic converter for all HF llama-like models - Tutorials - OpenNMT](https://forum.opennmt.net/t/awq-quantization-support-new-generic-converter-for-all-hf-llama-like-models/5569)Quantization and Acceleration We have added support for already quantized models, and revamped the converter for all llama-like models, whether they are quantized or not. Here's an example of the syntax: python tools/convert_HF_llamalike.py --model_dir "TheBloke/Nous-Hermes-Llama2-AWQ" --output "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt" --format safetensors
For llama-like models, we download the After converting, you will need a config file to run transforms: [sentencepiece]
#### Subword
src_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model"
tgt_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model"
# Model info
model: "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt"
# Inference
# ... When considering your priority:
Please read more here: GitHub - casper-hansen/AutoAWQ Important Note:
Offline Quantizer Script:
Enjoy! VS: Fast Inference with vLLM Recently, Mistral reported 100 tokens/second for Mistral-7B at batch size 1 and 1250 tokens/sec for a batch of 60 prompts using vLLM. When using Mistral-instruct-v0.2-onmt-awq, the performance was as follows:
This was with a GEMM model. To make a fair comparison, adjust the throughput for the step0 (prompt prefill) time. Suggested labels{ "key": "llm-quantization", "value": "Discussions and tools for handling quantized large language models" }#431: awq llama quantization### DetailsSimilarity score: 0.88 - [ ] [awq llama quantization](huggingface.co)Quantization and AccelerationWe have added support for already quantized models, and revamped the converter for all llama-like models, whether they are quantized or not. Model ConversionHere's an example of the syntax for converting a model: tools/convert_HF_llamalike.py --model_dir "TheBloke/Nous-Hermes-Llama2-AWQ" --output "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt" --format safetensors
For llama-like models, we download the tokenizer.model and generate a vocab file during the process. If the model is a AWQ quantized model, we will convert it to an OpenNMT-py AWQ quantized model. Config FileAfter converting, you will need a config file to run transforms: [sentencepiece]
Subword:
src_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model"
tgt_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model"
Model info:
model: "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt"
Inference:
# ... PriorityWhen considering your priority:
Important Note
Offline Quantizer ScriptWe will provide an offline quantizer script for OpenNMT-py generic models. However, for small NMT models, AWQ may make things slower, so it might not be relevant for NMT. vLLM PerformanceRecently, Mistral reported 100 tokens/second for Mistral-7B at batch size 1 and 1250 tokens/sec for a batch of 60 prompts using vLLM. When using Mistral-instruct-v0.2-onmt-awq, the performance was as follows:
Suggested labelsnull#456: Baseline benchmark for 17 coding models : r/LocalLLaMA### DetailsSimilarity score: 0.87 - [ ] [Baseline benchmark for 17 coding models : r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/19fc4uf/baseline_benchmark_for_17_coding_models/)Baseline Benchmark for 17 Coding ModelsDiscussionI am currently working on implementing some ideas for coding models inference strategies (prompting, control, context exploration, CoT, ToT, etc) and I needed a baseline benchmark on a bunch of models. Since I work on a 3060 12GB, I was limited in what I can test so I went for every model that is 7/13B and has an AWQ quant available, since that is what the inference library that I use supports. I thought I'd share some numbers. Notes:
f"""<s>You are a helpful and respectful assistant. Answer the following question: {question}""" ResultsI've plotted the results (with horrendous contrasting colors, but alas) to look for any interesting patterns in problem solving. You can find the plots here.
Suggested labels{ "label-name": "coding-models", "description": "Discussion and benchmark of coding models implementation strategies.", "confidence": 96.82 }#324: bigcode/tiny_starcoder_py · Hugging Face### DetailsSimilarity score: 0.87 > **Note:** > > [bigcode/tiny_starcoder_py · Hugging Face](https://huggingface.co/bigcode/tiny_starcoder_py) > > TinyStarCoderPy > > This is a 164M parameters model with the same architecture as StarCoder (8k context length, MQA & FIM). It was trained on the Python data from StarCoderData for ~6 epochs which amounts to 100B tokens. > > Use > > Intended use > > The model was trained on GitHub code, to assist with some tasks like Assisted Generation. For pure code completion, we advise using our 15B models StarCoder or StarCoderBase. > > Generation > > ```python > # pip install -q transformers > from transformers import AutoModelForCausalLM, AutoTokenizer > > checkpoint = "bigcode/tiny_starcoder_py" > device = "cuda" # for GPU usage or "cpu" for CPU usage > > tokenizer = AutoTokenizer.from_pretrained(checkpoint) > model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device) > > inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device) > outputs = model.generate(inputs) > print(tokenizer.decode(outputs[0])) > ``` > > Fill-in-the-middle > > Fill-in-the-middle uses special tokens to identify the prefix/middle/suffix part of the input and output: > > ```python > input_text = "def print_one_two_three():\n print('one')\n \n print('three')" > inputs = tokenizer.encode(input_text, return_tensors="pt").to(device) > outputs = model.generate(inputs) > print(tokenizer.decode(outputs[0])) > ``` > > Training > > Model > > - Architecture: GPT-2 model with multi-query attention and Fill-in-the-Middle objective > - Pretraining steps: 50k > - Pretraining tokens: 100 billion > - Precision: bfloat16 > > Hardware > > - GPUs: 32 Tesla A100 > - Training time: 18 hours > > Software > > - Orchestration: Megatron-LM > - Neural networks: PyTorch > - BP16 if applicable: apex > > License > > The model is licensed under the BigCode OpenRAIL-M v1 license agreement. You can find the full agreement [here](https://huggingface.co/bigcode/tiny_starcoder_py/blob/main/LICENSE). > > #### Suggested labels > > - { "key": "llm-pretraining", "value": "Information related to the pretraining process of Large Language Models" }#150: Mixture of Experts Explained### DetailsSimilarity score: 0.86 - [ ] [Mixture of Experts Explained](https://huggingface.co/blog/moe)TL;DR MoEs: Are pretrained much faster vs. dense models What is a Mixture of Experts (MoE)? The scale of a model is one of the most important axes for better model quality. Given a fixed computing budget, training a larger model for fewer steps is better than training a smaller model for more steps. Mixture of Experts enable models to be pretrained with far less compute, which means you can dramatically scale up the model or dataset size with the same compute budget as a dense model. In particular, a MoE model should achieve the same quality as its dense counterpart much faster during pretraining. So, what exactly is a MoE? In the context of transformer models, a MoE consists of two main elements: Sparse MoE layers are used instead of dense feed-forward network (FFN) layers. MoE layers have a certain number of “experts” (e.g. 8), where each expert is a neural network. In practice, the experts are FFNs, but they can also be more complex networks or even a MoE itself, leading to hierarchical MoEs! #304: GPTQ vs EXL2 vs AWQ vs Q4_K_M model sizes : r/Oobabooga### DetailsSimilarity score: 0.86 - [ ] [GPTQ vs EXL2 vs AWQ vs Q4_K_M model sizes : r/Oobabooga](https://www.reddit.com/r/Oobabooga/comments/178yqmg/gptq_vs_exl2_vs_awq_vs_q4_k_m_model_sizes/ GPTQ vs EXL2 vs AWQ vs Q4_K_M model sizes : r/Oobabooga)GPTQ vs EXL2 vs AWQ vs Q4_K_M model sizes Mod Post I couldn't test AWQ yet because my quantization ended up broken, possibly due to this particular model using NTK scaling, so I'll probably have to go through the fun of burning my GPU for 16 hours again to quantize and evaluate another model so that a conclusion can be reached. Also no idea if Phind-CodeLlama is actually good. WizardCoder-Python might be better. Suggested labels"LLM-Quantization" |
TITLE: Qwen-1.5-8x7B : r/LocalLLaMA
DESCRIPTION: "Qwen-1.5-8x7B
New Model
Someone created a sparse MoE Qwen model by merging and finetuning Qwen1.5-7B
Model: Link to Model
Dataset: Link to Dataset
Thread:
I'm excited to release a project I've been working on the last couple of weeks.
Qwen1.5-8x7b: Link to Model
And the accompanying dataset created with the intention of encouraging MoE models to organically develop their own experts: Link to Dataset
The purpose and intention behind this project is better detailed in the model/dataset card, but basically:
I curated a diverse dataset from the highest quality conversations I could find. It's actually great. All sources are included in the dataset card.
I then trained Qwen1.5-7b on a 100k subset over 4 epochs.
Took that and made a MoE using @maximelabonne 's lazymergekit, utilizing a random gate and no base model.
Trained that on another 351,000 pairs. I had planned on doing 4 full epochs, but @runpod_io had cuda errors in my machine 3x, expending the rest of my budget for the project after only 0.45/4 epochs.
Good news:
Model is surprisingly awesome even at such a (comparatively) small training set size. Reasoning compares with Mixtral in my (very basic) tests.
Will benchmark it properly once runpod situation gets sorted, and plan to finish the rest of the training.
Thank you to @teknium1 , @jon_durbin , @erhartford , Maxime Labonne, and @chargoddard for their contributions to open source AI and making these processes accessible and transparent. And of course thank you to @mistralai for inspiring this work and @alibaba_cloud for releasing the weights of the Qwen1.5 family.
Teknium and Eric Hartford have been especially helpful, answering questions with humility and generosity.
We're just getting started."
URL: Link to Reddit Post
Suggested labels
{'label-name': 'MoE-model', 'label-description': 'Refers to a Mixture of Experts model created by merging and finetuning Qwen1.5-7B.', 'gh-repo': 'llm', 'confidence': 52.49}
The text was updated successfully, but these errors were encountered: