Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qwen-1.5-8x7B : r/LocalLLaMA #647

Open
1 task
irthomasthomas opened this issue Feb 28, 2024 · 1 comment
Open
1 task

Qwen-1.5-8x7B : r/LocalLLaMA #647

irthomasthomas opened this issue Feb 28, 2024 · 1 comment
Labels
base-model llm base models not finetuned for chat dataset public datasets and embeddings llm Large Language Models llm-experiments experiments with large language models llm-inference-engines Software to run inference on large language models New-Label Choose this option if the existing labels are insufficient to describe the content accurately Research personal research notes for a topic

Comments

@irthomasthomas
Copy link
Owner

TITLE: Qwen-1.5-8x7B : r/LocalLLaMA

DESCRIPTION: "Qwen-1.5-8x7B

New Model
Someone created a sparse MoE Qwen model by merging and finetuning Qwen1.5-7B

Model: Link to Model

Dataset: Link to Dataset

Thread:

I'm excited to release a project I've been working on the last couple of weeks.

Qwen1.5-8x7b: Link to Model

And the accompanying dataset created with the intention of encouraging MoE models to organically develop their own experts: Link to Dataset

The purpose and intention behind this project is better detailed in the model/dataset card, but basically:

I curated a diverse dataset from the highest quality conversations I could find. It's actually great. All sources are included in the dataset card.

I then trained Qwen1.5-7b on a 100k subset over 4 epochs.

Took that and made a MoE using @maximelabonne 's lazymergekit, utilizing a random gate and no base model.

Trained that on another 351,000 pairs. I had planned on doing 4 full epochs, but @runpod_io had cuda errors in my machine 3x, expending the rest of my budget for the project after only 0.45/4 epochs.

Good news:

Model is surprisingly awesome even at such a (comparatively) small training set size. Reasoning compares with Mixtral in my (very basic) tests.

Will benchmark it properly once runpod situation gets sorted, and plan to finish the rest of the training.

Thank you to @teknium1 , @jon_durbin , @erhartford , Maxime Labonne, and @chargoddard for their contributions to open source AI and making these processes accessible and transparent. And of course thank you to @mistralai for inspiring this work and @alibaba_cloud for releasing the weights of the Qwen1.5 family.

Teknium and Eric Hartford have been especially helpful, answering questions with humility and generosity.

We're just getting started."

URL: Link to Reddit Post

Suggested labels

{'label-name': 'MoE-model', 'label-description': 'Refers to a Mixture of Experts model created by merging and finetuning Qwen1.5-7B.', 'gh-repo': 'llm', 'confidence': 52.49}

@irthomasthomas irthomasthomas added base-model llm base models not finetuned for chat dataset public datasets and embeddings llm Large Language Models llm-experiments experiments with large language models llm-inference-engines Software to run inference on large language models New-Label Choose this option if the existing labels are insufficient to describe the content accurately Research personal research notes for a topic labels Feb 28, 2024
@irthomasthomas
Copy link
Owner Author

Related issues

#389: AWQ Quantization support - New generic converter for all HF llama-like models - Tutorials - OpenNMT

### DetailsSimilarity score: 0.88 - [ ] [AWQ Quantization support - New generic converter for all HF llama-like models - Tutorials - OpenNMT](https://forum.opennmt.net/t/awq-quantization-support-new-generic-converter-for-all-hf-llama-like-models/5569)

Quantization and Acceleration

We have added support for already quantized models, and revamped the converter for all llama-like models, whether they are quantized or not. Here's an example of the syntax:

python tools/convert_HF_llamalike.py --model_dir "TheBloke/Nous-Hermes-Llama2-AWQ" --output "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt" --format safetensors
  • TheBloke/Nous-Hermes-Llama2-AWQ: The name of the repository/model on the Hugging Face Hub.
  • output: Specifies the target directory and model name you want to save.
  • format: Optionally, you can save as safetensors.

For llama-like models, we download the tokenizer.model and generate a vocab file during the process. If the model is a AWQ quantized model, we will convert it to an OpenNMT-py AWQ quantized model.

After converting, you will need a config file to run translate.py or run_mmlu_opnenmt.py. Here's an example of the config:

transforms: [sentencepiece]

#### Subword
src_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model"
tgt_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model"

# Model info
model: "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt"

# Inference
# ...

When considering your priority:

  • For small model files to fit VRAM of your GPU, try AWQ, but it will be slow for large batch sizes.
  • AWQ models are faster than FP16 for batch size 1.

Please read more here: GitHub - casper-hansen/AutoAWQ

Important Note:

  • There are two AWQ toolkits (llm-awq and AutoAWQ) and AutoAWQ supports two flavors: GEMM / GEMV.
  • The original llm-awq from MIT is not maintained periodically, so we default to AutoAWQ.
  • If a model is tagged llm-awq on the HF hub, we use AutoAWQ/GEMV, which is compatible.

Offline Quantizer Script:

  • We will provide an offline quantizer script for OpenNMT-py generic models. However, for small NMT models, AWQ may make things slower, so it might not be relevant for NMT.

Enjoy!


VS: Fast Inference with vLLM

Recently, Mistral reported 100 tokens/second for Mistral-7B at batch size 1 and 1250 tokens/sec for a batch of 60 prompts using vLLM. When using Mistral-instruct-v0.2-onmt-awq, the performance was as follows:

  • Batch size 1: 80.5 tokens/second
  • Batch size 60: 98 tokens/second, with GEMV being 20-25% faster.

This was with a GEMM model. To make a fair comparison, adjust the throughput for the step0 (prompt prefill) time.

Suggested labels

{ "key": "llm-quantization", "value": "Discussions and tools for handling quantized large language models" }

#431: awq llama quantization

### DetailsSimilarity score: 0.88 - [ ] [awq llama quantization](huggingface.co)

Quantization and Acceleration

We have added support for already quantized models, and revamped the converter for all llama-like models, whether they are quantized or not.

Model Conversion

Here's an example of the syntax for converting a model:

tools/convert_HF_llamalike.py --model_dir "TheBloke/Nous-Hermes-Llama2-AWQ" --output "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt" --format safetensors
  • TheBloke/Nous-Hermes-Llama2-AWQ: The name of the repository/model on the Hugging Face Hub.
  • output: Specifies the target directory and model name you want to save.
  • format: Optionally, you can save as safetensors.

For llama-like models, we download the tokenizer.model and generate a vocab file during the process. If the model is a AWQ quantized model, we will convert it to an OpenNMT-py AWQ quantized model.

Config File

After converting, you will need a config file to run translate.py or run_mmlu_opnenmt.py. Here's an example of the config:

transforms: [sentencepiece]

Subword:
  src_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model"
  tgt_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model"

Model info:
  model: "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt"

Inference:
  # ...

Priority

When considering your priority:

  • For small model files to fit VRAM of your GPU, try AWQ, but it will be slow for large batch sizes.
  • AWQ models are faster than FP16 for batch size 1.
  • Read more: GitHub - casper-hansen/AutoAWQ

Important Note

  • There are two AWQ toolkits (llm-awq and AutoAWQ) and AutoAWQ supports two flavors: GEMM / GEMV.
  • The original llm-awq from MIT is not maintained periodically, so we default to AutoAWQ.
  • If a model is tagged llm-awq on the HF hub, we use AutoAWQ/GEMV, which is compatible.

Offline Quantizer Script

We will provide an offline quantizer script for OpenNMT-py generic models. However, for small NMT models, AWQ may make things slower, so it might not be relevant for NMT.

vLLM Performance

Recently, Mistral reported 100 tokens/second for Mistral-7B at batch size 1 and 1250 tokens/sec for a batch of 60 prompts using vLLM. When using Mistral-instruct-v0.2-onmt-awq, the performance was as follows:

  • Batch size 1: 80.5 tokens/second
  • Batch size 60: 98 tokens/second, with GEMV being 20-25% faster.
  • This was with a GEMM model. To make a fair comparison, adjust the throughput for the step0 (prompt prefill) time.

Suggested labels

null

#456: Baseline benchmark for 17 coding models : r/LocalLLaMA

### DetailsSimilarity score: 0.87 - [ ] [Baseline benchmark for 17 coding models : r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/19fc4uf/baseline_benchmark_for_17_coding_models/)

Baseline Benchmark for 17 Coding Models

Discussion

I am currently working on implementing some ideas for coding models inference strategies (prompting, control, context exploration, CoT, ToT, etc) and I needed a baseline benchmark on a bunch of models. Since I work on a 3060 12GB, I was limited in what I can test so I went for every model that is 7/13B and has an AWQ quant available, since that is what the inference library that I use supports. I thought I'd share some numbers.

Notes:

  • This is a benchmark for getting a local baseline. I'm interested in improvement from here, so the absolute values are less important for me. Don't take the absolute values too seriously. (well, maybe except deepseek-coder-1.3b, that is a bit suspect).
  • I used the HumanEval dataset. This is superseded by HumanEval+ and other more recent benchmarks. I chose this because it was the first one I tried. Again, with my tests I'm looking for improvements over the baseline, so this is mostly fine.
  • AWQ quant is not the best out there, but all my tests will be done with this quant, so for me it is OK.
  • Temp tests were done in only one generation. In general you'd want to average the score over many generations at a given temp.
  • Each model was prompted according to the model card template. Here's an example for the codellama series -
f"""<s>You are a helpful and respectful assistant. Answer the following question: {question}"""

Results

I've plotted the results (with horrendous contrasting colors, but alas) to look for any interesting patterns in problem solving. You can find the plots here.

Model Temp Correct / 164 Percentage
TheBloke/Mistral-7B-Instruct-v0.2-AWQ 0.0 67 0.40853658536585363
TheBloke/Mistral-7B-Instruct-v0.2-AWQ 0.1 63 0.38414634146341464
TheBloke/Mistral-7B-Instruct-v0.2-AWQ 0.2 68 0.4146341463414634
TheBloke/Mistral-7B-Instruct-v0.2-AWQ 0.3 61 0.3719512195121951
TheBloke/Mistral-7B-Instruct-v0.2-AWQ 0.4 61 0.3719512195121951
TheBloke/Mistral-7B-Instruct-v0.2-AWQ 0.5 63 0.38414634146341464
TheBloke/Mistral-7B-Instruct-v0.2-AWQ 0.6 54 0.32926829268292684
TheBloke/Mistral-7B-Instruct-v0.2-AWQ 0.7 61 0.3719512195121951
TheBloke/Mistral-7B-Instruct-v0.2-AWQ 0.8 60 0.36585365853658536
TheBloke/Mistral-7B-Instruct-v0.2-AWQ 0.9 59 0.3597560975609756
TheBloke/Mistral-7B-Instruct-v0.2-AWQ 1.0 65 0.39634146341463417

Suggested labels

{ "label-name": "coding-models", "description": "Discussion and benchmark of coding models implementation strategies.", "confidence": 96.82 }

#324: bigcode/tiny_starcoder_py · Hugging Face

### DetailsSimilarity score: 0.87 > **Note:** > > [bigcode/tiny_starcoder_py · Hugging Face](https://huggingface.co/bigcode/tiny_starcoder_py) > > TinyStarCoderPy > > This is a 164M parameters model with the same architecture as StarCoder (8k context length, MQA & FIM). It was trained on the Python data from StarCoderData for ~6 epochs which amounts to 100B tokens. > > Use > > Intended use > > The model was trained on GitHub code, to assist with some tasks like Assisted Generation. For pure code completion, we advise using our 15B models StarCoder or StarCoderBase. > > Generation > > ```python > # pip install -q transformers > from transformers import AutoModelForCausalLM, AutoTokenizer > > checkpoint = "bigcode/tiny_starcoder_py" > device = "cuda" # for GPU usage or "cpu" for CPU usage > > tokenizer = AutoTokenizer.from_pretrained(checkpoint) > model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device) > > inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device) > outputs = model.generate(inputs) > print(tokenizer.decode(outputs[0])) > ``` > > Fill-in-the-middle > > Fill-in-the-middle uses special tokens to identify the prefix/middle/suffix part of the input and output: > > ```python > input_text = "def print_one_two_three():\n print('one')\n \n print('three')" > inputs = tokenizer.encode(input_text, return_tensors="pt").to(device) > outputs = model.generate(inputs) > print(tokenizer.decode(outputs[0])) > ``` > > Training > > Model > > - Architecture: GPT-2 model with multi-query attention and Fill-in-the-Middle objective > - Pretraining steps: 50k > - Pretraining tokens: 100 billion > - Precision: bfloat16 > > Hardware > > - GPUs: 32 Tesla A100 > - Training time: 18 hours > > Software > > - Orchestration: Megatron-LM > - Neural networks: PyTorch > - BP16 if applicable: apex > > License > > The model is licensed under the BigCode OpenRAIL-M v1 license agreement. You can find the full agreement [here](https://huggingface.co/bigcode/tiny_starcoder_py/blob/main/LICENSE). > > #### Suggested labels > > - { "key": "llm-pretraining", "value": "Information related to the pretraining process of Large Language Models" }

#150: Mixture of Experts Explained

### DetailsSimilarity score: 0.86 - [ ] [Mixture of Experts Explained](https://huggingface.co/blog/moe)

TL;DR

MoEs:

Are pretrained much faster vs. dense models
Have faster inference compared to a model with the same number of parameters
Require high VRAM as all experts are loaded in memory
Face many challenges in fine-tuning, but recent work with MoE instruction-tuning is promising
Let’s dive in!

What is a Mixture of Experts (MoE)?

The scale of a model is one of the most important axes for better model quality. Given a fixed computing budget, training a larger model for fewer steps is better than training a smaller model for more steps.

Mixture of Experts enable models to be pretrained with far less compute, which means you can dramatically scale up the model or dataset size with the same compute budget as a dense model. In particular, a MoE model should achieve the same quality as its dense counterpart much faster during pretraining.

So, what exactly is a MoE? In the context of transformer models, a MoE consists of two main elements:

Sparse MoE layers are used instead of dense feed-forward network (FFN) layers. MoE layers have a certain number of “experts” (e.g. 8), where each expert is a neural network. In practice, the experts are FFNs, but they can also be more complex networks or even a MoE itself, leading to hierarchical MoEs!
A gate network or router, that determines which tokens are sent to which expert. For example, in the image below, the token “More” is sent to the second expert, and the token "Parameters” is sent to the first network. As we’ll explore later, we can send a token to more than one expert. How to route a token to an expert is one of the big decisions when working with MoEs - the router is composed of learned parameters and is pretrained at the same time as the rest of the network.

#304: GPTQ vs EXL2 vs AWQ vs Q4_K_M model sizes : r/Oobabooga

### DetailsSimilarity score: 0.86 - [ ] [GPTQ vs EXL2 vs AWQ vs Q4_K_M model sizes : r/Oobabooga](https://www.reddit.com/r/Oobabooga/comments/178yqmg/gptq_vs_exl2_vs_awq_vs_q4_k_m_model_sizes/ GPTQ vs EXL2 vs AWQ vs Q4_K_M model sizes : r/Oobabooga)

GPTQ vs EXL2 vs AWQ vs Q4_K_M model sizes

Mod Post
Size (mb) Model
16560 Phind_Phind-CodeLlama-34B-v2-EXL2-4.000b
17053 Phind_Phind-CodeLlama-34B-v2-EXL2-4.125b
17463 Phind-CodeLlama-34B-v2-AWQ-4bit-128g
17480 Phind-CodeLlama-34B-v2-GPTQ-4bit-128g-actorder
17548 Phind_Phind-CodeLlama-34B-v2-EXL2-4.250b
18143 Phind_Phind-CodeLlama-34B-v2-EXL2-4.400b
19133 Phind_Phind-CodeLlama-34B-v2-EXL2-4.650b
19284 phind-codellama-34b-v2.Q4_K_M.gguf
19320 Phind-CodeLlama-34B-v2-AWQ-4bit-32g
19337 Phind-CodeLlama-34B-v2-GPTQ-4bit-32g-actorder
I created all these EXL2 quants to compare them to GPTQ and AWQ. The preliminary result is that EXL2 4.4b seems to outperform GPTQ-4bit-32g while EXL2 4.125b seems to outperform GPTQ-4bit-128g while using less VRAM in both cases.

I couldn't test AWQ yet because my quantization ended up broken, possibly due to this particular model using NTK scaling, so I'll probably have to go through the fun of burning my GPU for 16 hours again to quantize and evaluate another model so that a conclusion can be reached.

Also no idea if Phind-CodeLlama is actually good. WizardCoder-Python might be better.

Suggested labels

"LLM-Quantization"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
base-model llm base models not finetuned for chat dataset public datasets and embeddings llm Large Language Models llm-experiments experiments with large language models llm-inference-engines Software to run inference on large language models New-Label Choose this option if the existing labels are insufficient to describe the content accurately Research personal research notes for a topic
Projects
None yet
Development

No branches or pull requests

1 participant