Qwen-1.5-8x7B : r/LocalLLaMA #647

irthomasthomas · 2024-02-28T08:47:48Z

Qwen-1.5-8x7B : r/LocalLLaMA

TITLE: Qwen-1.5-8x7B : r/LocalLLaMA

DESCRIPTION: "Qwen-1.5-8x7B

New Model
Someone created a sparse MoE Qwen model by merging and finetuning Qwen1.5-7B

Model: Link to Model

Dataset: Link to Dataset

Thread:

I'm excited to release a project I've been working on the last couple of weeks.

Qwen1.5-8x7b: Link to Model

And the accompanying dataset created with the intention of encouraging MoE models to organically develop their own experts: Link to Dataset

The purpose and intention behind this project is better detailed in the model/dataset card, but basically:

I curated a diverse dataset from the highest quality conversations I could find. It's actually great. All sources are included in the dataset card.

I then trained Qwen1.5-7b on a 100k subset over 4 epochs.

Took that and made a MoE using @maximelabonne 's lazymergekit, utilizing a random gate and no base model.

Trained that on another 351,000 pairs. I had planned on doing 4 full epochs, but @runpod_io had cuda errors in my machine 3x, expending the rest of my budget for the project after only 0.45/4 epochs.

Good news:

Model is surprisingly awesome even at such a (comparatively) small training set size. Reasoning compares with Mixtral in my (very basic) tests.

Will benchmark it properly once runpod situation gets sorted, and plan to finish the rest of the training.

Thank you to @teknium1 , @jon_durbin , @erhartford , Maxime Labonne, and @chargoddard for their contributions to open source AI and making these processes accessible and transparent. And of course thank you to @mistralai for inspiring this work and @alibaba_cloud for releasing the weights of the Qwen1.5 family.

Teknium and Eric Hartford have been especially helpful, answering questions with humility and generosity.

We're just getting started."

URL: Link to Reddit Post

Suggested labels

{'label-name': 'MoE-model', 'label-description': 'Refers to a Mixture of Experts model created by merging and finetuning Qwen1.5-7B.', 'gh-repo': 'llm', 'confidence': 52.49}

irthomasthomas · 2024-02-28T08:47:50Z

Related issues

#389: AWQ Quantization support - New generic converter for all HF llama-like models - Tutorials - OpenNMT

### Details

Similarity score: 0.88 - [ ] [AWQ Quantization support - New generic converter for all HF llama-like models - Tutorials - OpenNMT](https://forum.opennmt.net/t/awq-quantization-support-new-generic-converter-for-all-hf-llama-like-models/5569)

Quantization and Acceleration

We have added support for already quantized models, and revamped the converter for all llama-like models, whether they are quantized or not. Here's an example of the syntax:

python tools/convert_HF_llamalike.py --model_dir "TheBloke/Nous-Hermes-Llama2-AWQ" --output "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt" --format safetensors

TheBloke/Nous-Hermes-Llama2-AWQ: The name of the repository/model on the Hugging Face Hub.
output: Specifies the target directory and model name you want to save.
format: Optionally, you can save as safetensors.

For llama-like models, we download the tokenizer.model and generate a vocab file during the process. If the model is a AWQ quantized model, we will convert it to an OpenNMT-py AWQ quantized model.

After converting, you will need a config file to run translate.py or run_mmlu_opnenmt.py. Here's an example of the config:

transforms: [sentencepiece]

#### Subword
src_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model"
tgt_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model"

# Model info
model: "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt"

# Inference
# ...

When considering your priority:

For small model files to fit VRAM of your GPU, try AWQ, but it will be slow for large batch sizes.
AWQ models are faster than FP16 for batch size 1.

Please read more here: GitHub - casper-hansen/AutoAWQ

Important Note:

There are two AWQ toolkits (llm-awq and AutoAWQ) and AutoAWQ supports two flavors: GEMM / GEMV.
The original llm-awq from MIT is not maintained periodically, so we default to AutoAWQ.
If a model is tagged llm-awq on the HF hub, we use AutoAWQ/GEMV, which is compatible.

Offline Quantizer Script:

We will provide an offline quantizer script for OpenNMT-py generic models. However, for small NMT models, AWQ may make things slower, so it might not be relevant for NMT.

Enjoy!

VS: Fast Inference with vLLM

Recently, Mistral reported 100 tokens/second for Mistral-7B at batch size 1 and 1250 tokens/sec for a batch of 60 prompts using vLLM. When using Mistral-instruct-v0.2-onmt-awq, the performance was as follows:

Batch size 1: 80.5 tokens/second
Batch size 60: 98 tokens/second, with GEMV being 20-25% faster.

This was with a GEMM model. To make a fair comparison, adjust the throughput for the step0 (prompt prefill) time.

Suggested labels

{ "key": "llm-quantization", "value": "Discussions and tools for handling quantized large language models" }

#431: awq llama quantization

### Details

Similarity score: 0.88 - [ ] [awq llama quantization](huggingface.co)

Quantization and Acceleration

We have added support for already quantized models, and revamped the converter for all llama-like models, whether they are quantized or not.

Model Conversion

Here's an example of the syntax for converting a model:

tools/convert_HF_llamalike.py --model_dir "TheBloke/Nous-Hermes-Llama2-AWQ" --output "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt" --format safetensors

TheBloke/Nous-Hermes-Llama2-AWQ: The name of the repository/model on the Hugging Face Hub.
output: Specifies the target directory and model name you want to save.
format: Optionally, you can save as safetensors.

For llama-like models, we download the tokenizer.model and generate a vocab file during the process. If the model is a AWQ quantized model, we will convert it to an OpenNMT-py AWQ quantized model.

Config File

After converting, you will need a config file to run translate.py or run_mmlu_opnenmt.py. Here's an example of the config:

transforms: [sentencepiece]

Subword:
  src_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model"
  tgt_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model"

Model info:
  model: "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt"

Inference:
  # ...

Priority

When considering your priority:

For small model files to fit VRAM of your GPU, try AWQ, but it will be slow for large batch sizes.
AWQ models are faster than FP16 for batch size 1.
Read more: GitHub - casper-hansen/AutoAWQ

Important Note

There are two AWQ toolkits (llm-awq and AutoAWQ) and AutoAWQ supports two flavors: GEMM / GEMV.
The original llm-awq from MIT is not maintained periodically, so we default to AutoAWQ.
If a model is tagged llm-awq on the HF hub, we use AutoAWQ/GEMV, which is compatible.

Offline Quantizer Script

We will provide an offline quantizer script for OpenNMT-py generic models. However, for small NMT models, AWQ may make things slower, so it might not be relevant for NMT.

vLLM Performance

Recently, Mistral reported 100 tokens/second for Mistral-7B at batch size 1 and 1250 tokens/sec for a batch of 60 prompts using vLLM. When using Mistral-instruct-v0.2-onmt-awq, the performance was as follows:

Batch size 1: 80.5 tokens/second
Batch size 60: 98 tokens/second, with GEMV being 20-25% faster.
This was with a GEMM model. To make a fair comparison, adjust the throughput for the step0 (prompt prefill) time.

Suggested labels

null

#456: Baseline benchmark for 17 coding models : r/LocalLLaMA

### Details

Similarity score: 0.87 - [ ] [Baseline benchmark for 17 coding models : r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/19fc4uf/baseline_benchmark_for_17_coding_models/)

Baseline Benchmark for 17 Coding Models

Discussion

I am currently working on implementing some ideas for coding models inference strategies (prompting, control, context exploration, CoT, ToT, etc) and I needed a baseline benchmark on a bunch of models. Since I work on a 3060 12GB, I was limited in what I can test so I went for every model that is 7/13B and has an AWQ quant available, since that is what the inference library that I use supports. I thought I'd share some numbers.

Notes:

This is a benchmark for getting a local baseline. I'm interested in improvement from here, so the absolute values are less important for me. Don't take the absolute values too seriously. (well, maybe except deepseek-coder-1.3b, that is a bit suspect).
I used the HumanEval dataset. This is superseded by HumanEval+ and other more recent benchmarks. I chose this because it was the first one I tried. Again, with my tests I'm looking for improvements over the baseline, so this is mostly fine.
AWQ quant is not the best out there, but all my tests will be done with this quant, so for me it is OK.
Temp tests were done in only one generation. In general you'd want to average the score over many generations at a given temp.
Each model was prompted according to the model card template. Here's an example for the codellama series -

f"""<s>You are a helpful and respectful assistant. Answer the following question: {question}"""

Results

I've plotted the results (with horrendous contrasting colors, but alas) to look for any interesting patterns in problem solving. You can find the plots here.

Model	Temp	Correct / 164	Percentage
TheBloke/Mistral-7B-Instruct-v0.2-AWQ	0.0	67	0.40853658536585363
TheBloke/Mistral-7B-Instruct-v0.2-AWQ	0.1	63	0.38414634146341464
TheBloke/Mistral-7B-Instruct-v0.2-AWQ	0.2	68	0.4146341463414634
TheBloke/Mistral-7B-Instruct-v0.2-AWQ	0.3	61	0.3719512195121951
TheBloke/Mistral-7B-Instruct-v0.2-AWQ	0.4	61	0.3719512195121951
TheBloke/Mistral-7B-Instruct-v0.2-AWQ	0.5	63	0.38414634146341464
TheBloke/Mistral-7B-Instruct-v0.2-AWQ	0.6	54	0.32926829268292684
TheBloke/Mistral-7B-Instruct-v0.2-AWQ	0.7	61	0.3719512195121951
TheBloke/Mistral-7B-Instruct-v0.2-AWQ	0.8	60	0.36585365853658536
TheBloke/Mistral-7B-Instruct-v0.2-AWQ	0.9	59	0.3597560975609756
TheBloke/Mistral-7B-Instruct-v0.2-AWQ	1.0	65	0.39634146341463417

Suggested labels

{ "label-name": "coding-models", "description": "Discussion and benchmark of coding models implementation strategies.", "confidence": 96.82 }

#324: bigcode/tiny_starcoder_py · Hugging Face

### Details

Similarity score: 0.87 > **Note:** > > [bigcode/tiny_starcoder_py · Hugging Face](https://huggingface.co/bigcode/tiny_starcoder_py) > > TinyStarCoderPy > > This is a 164M parameters model with the same architecture as StarCoder (8k context length, MQA & FIM). It was trained on the Python data from StarCoderData for ~6 epochs which amounts to 100B tokens. > > Use > > Intended use > > The model was trained on GitHub code, to assist with some tasks like Assisted Generation. For pure code completion, we advise using our 15B models StarCoder or StarCoderBase. > > Generation > > ```python > # pip install -q transformers > from transformers import AutoModelForCausalLM, AutoTokenizer > > checkpoint = "bigcode/tiny_starcoder_py" > device = "cuda" # for GPU usage or "cpu" for CPU usage > > tokenizer = AutoTokenizer.from_pretrained(checkpoint) > model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device) > > inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device) > outputs = model.generate(inputs) > print(tokenizer.decode(outputs[0])) > ``` > > Fill-in-the-middle > > Fill-in-the-middle uses special tokens to identify the prefix/middle/suffix part of the input and output: > > ```python > input_text = "def print_one_two_three():\n print('one')\n \n print('three')" > inputs = tokenizer.encode(input_text, return_tensors="pt").to(device) > outputs = model.generate(inputs) > print(tokenizer.decode(outputs[0])) > ``` > > Training > > Model > > - Architecture: GPT-2 model with multi-query attention and Fill-in-the-Middle objective > - Pretraining steps: 50k > - Pretraining tokens: 100 billion > - Precision: bfloat16 > > Hardware > > - GPUs: 32 Tesla A100 > - Training time: 18 hours > > Software > > - Orchestration: Megatron-LM > - Neural networks: PyTorch > - BP16 if applicable: apex > > License > > The model is licensed under the BigCode OpenRAIL-M v1 license agreement. You can find the full agreement [here](https://huggingface.co/bigcode/tiny_starcoder_py/blob/main/LICENSE). > > #### Suggested labels > > - { "key": "llm-pretraining", "value": "Information related to the pretraining process of Large Language Models" }

#150: Mixture of Experts Explained

### Details

Similarity score: 0.86 - [ ] [Mixture of Experts Explained](https://huggingface.co/blog/moe)

TL;DR

MoEs:

Are pretrained much faster vs. dense models
Have faster inference compared to a model with the same number of parameters
Require high VRAM as all experts are loaded in memory
Face many challenges in fine-tuning, but recent work with MoE instruction-tuning is promising
Let’s dive in!

What is a Mixture of Experts (MoE)?

The scale of a model is one of the most important axes for better model quality. Given a fixed computing budget, training a larger model for fewer steps is better than training a smaller model for more steps.

Mixture of Experts enable models to be pretrained with far less compute, which means you can dramatically scale up the model or dataset size with the same compute budget as a dense model. In particular, a MoE model should achieve the same quality as its dense counterpart much faster during pretraining.

So, what exactly is a MoE? In the context of transformer models, a MoE consists of two main elements:

Sparse MoE layers are used instead of dense feed-forward network (FFN) layers. MoE layers have a certain number of “experts” (e.g. 8), where each expert is a neural network. In practice, the experts are FFNs, but they can also be more complex networks or even a MoE itself, leading to hierarchical MoEs!
A gate network or router, that determines which tokens are sent to which expert. For example, in the image below, the token “More” is sent to the second expert, and the token "Parameters” is sent to the first network. As we’ll explore later, we can send a token to more than one expert. How to route a token to an expert is one of the big decisions when working with MoEs - the router is composed of learned parameters and is pretrained at the same time as the rest of the network.

#304: GPTQ vs EXL2 vs AWQ vs Q4_K_M model sizes : r/Oobabooga

### Details

Similarity score: 0.86 - [ ] [GPTQ vs EXL2 vs AWQ vs Q4_K_M model sizes : r/Oobabooga](https://www.reddit.com/r/Oobabooga/comments/178yqmg/gptq_vs_exl2_vs_awq_vs_q4_k_m_model_sizes/ GPTQ vs EXL2 vs AWQ vs Q4_K_M model sizes : r/Oobabooga)

GPTQ vs EXL2 vs AWQ vs Q4_K_M model sizes

Mod Post
Size (mb) Model
16560 Phind_Phind-CodeLlama-34B-v2-EXL2-4.000b
17053 Phind_Phind-CodeLlama-34B-v2-EXL2-4.125b
17463 Phind-CodeLlama-34B-v2-AWQ-4bit-128g
17480 Phind-CodeLlama-34B-v2-GPTQ-4bit-128g-actorder
17548 Phind_Phind-CodeLlama-34B-v2-EXL2-4.250b
18143 Phind_Phind-CodeLlama-34B-v2-EXL2-4.400b
19133 Phind_Phind-CodeLlama-34B-v2-EXL2-4.650b
19284 phind-codellama-34b-v2.Q4_K_M.gguf
19320 Phind-CodeLlama-34B-v2-AWQ-4bit-32g
19337 Phind-CodeLlama-34B-v2-GPTQ-4bit-32g-actorder
I created all these EXL2 quants to compare them to GPTQ and AWQ. The preliminary result is that EXL2 4.4b seems to outperform GPTQ-4bit-32g while EXL2 4.125b seems to outperform GPTQ-4bit-128g while using less VRAM in both cases.

I couldn't test AWQ yet because my quantization ended up broken, possibly due to this particular model using NTK scaling, so I'll probably have to go through the fun of burning my GPU for 16 hours again to quantize and evaluate another model so that a conclusion can be reached.

Also no idea if Phind-CodeLlama is actually good. WizardCoder-Python might be better.

Suggested labels

"LLM-Quantization"

This was referenced Feb 28, 2024

Gemma llm context window extended to 90k using LongLM #661

Open

Best way to add knowledge to a llm : r/LocalLLaMA #665

Open

Introducing the next generation of Claude \ Anthropic #685

Open

irthomasthomas mentioned this issue Mar 16, 2024

Yi Model Family: Powerful Multi-Dimensional Language and Multimodal Models #769

Open

1 task

ShellLM mentioned this issue Apr 28, 2024

Strategies for Managing Machine Learning Model Metadata and Lineage #814

Open

1 task

ShellLM mentioned this issue Aug 1, 2024

Mistral NeMo | Mistral AI | Frontier AI in your hands #851

Open

1 task

ShellLM mentioned this issue Aug 16, 2024

Mistral Nemo - Prompt suggestion #888

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen-1.5-8x7B : r/LocalLLaMA #647

Qwen-1.5-8x7B : r/LocalLLaMA #647

irthomasthomas commented Feb 28, 2024

irthomasthomas commented Feb 28, 2024

Suggested labels

{ "key": "llm-quantization", "value": "Discussions and tools for handling quantized large language models" }

Quantization and Acceleration

Model Conversion

Config File

Priority

Important Note

Offline Quantizer Script

vLLM Performance

Suggested labels

null

Baseline Benchmark for 17 Coding Models

Discussion

Results

Suggested labels

{ "label-name": "coding-models", "description": "Discussion and benchmark of coding models implementation strategies.", "confidence": 96.82 }

Suggested labels

"LLM-Quantization"

Qwen-1.5-8x7B : r/LocalLLaMA #647

Qwen-1.5-8x7B : r/LocalLLaMA #647

Comments

irthomasthomas commented Feb 28, 2024

TITLE: Qwen-1.5-8x7B : r/LocalLLaMA

Suggested labels

{'label-name': 'MoE-model', 'label-description': 'Refers to a Mixture of Experts model created by merging and finetuning Qwen1.5-7B.', 'gh-repo': 'llm', 'confidence': 52.49}

irthomasthomas commented Feb 28, 2024

Related issues

#389: AWQ Quantization support - New generic converter for all HF llama-like models - Tutorials - OpenNMT

Suggested labels

{ "key": "llm-quantization", "value": "Discussions and tools for handling quantized large language models" }

#431: awq llama quantization

Quantization and Acceleration

Model Conversion

Config File

Priority

Important Note

Offline Quantizer Script

vLLM Performance

Suggested labels

null

#456: Baseline benchmark for 17 coding models : r/LocalLLaMA

Baseline Benchmark for 17 Coding Models

Discussion

Results

Suggested labels

{ "label-name": "coding-models", "description": "Discussion and benchmark of coding models implementation strategies.", "confidence": 96.82 }

#324: bigcode/tiny_starcoder_py · Hugging Face

#150: Mixture of Experts Explained

#304: GPTQ vs EXL2 vs AWQ vs Q4_K_M model sizes : r/Oobabooga

Suggested labels

"LLM-Quantization"