About lora finetuning of 2:4 sparse and sparse quant models #952

arunpatala · 2024-12-04T12:26:30Z

I would like to thank for a great repo.

I have been testing the newly released sparse quant models and was amazed by speedup in both latency and throughput.
I just have some doubts regarding finetuning of 2:4 sparse models.

From what i understood, the model is first sparsified and then fully trained on some data to create sparse llama base model.
As this is not instruction tuned, we do another finetuning on instruction data (which is much smaller). But this still takes as much memory but lesser time.

The recipe provided in the examples, starts with a dense model and does sparsification based on calibration data. Then fine tuning is applied to create the sparse model to regain accuracy.

I would like to know if we can start with a sparse base model (like Sparse Llama 3.1 8B), and create a lora adapter using custom dataset. There can also be sparsity speedup for training lora adapters, is this possible? This would take a lot memory than finetuning step after sparsification.

Does this make sense, assuming VLLM supports serving sparse models with lora?
Can all this be also applied to sparse + w4a16 models to get Qlora +sparsity training and inference?

I would like to contribute if anyone points me in the right direction.

Thanks
Arun

robertgshaw2-neuralmagic · 2024-12-06T16:19:07Z

Hey @arunpatala

Your understanding is correct

The SparseLlama models (https://huggingface.co/neuralmagic/Sparse-Llama-3.1-8B-2of4) are intended to be fine-tuned onto a downstream dataset. We are working on some examples of how to do this further fine-tuning using llm-compressor, but these pathways
The examples in the repo do "sparse-fine-tuning" (see: https://arxiv.org/abs/2310.06927 for more details on the method).

LoRA

Training LoRA adapters on the 2:4 sparse base model is 100% something that we want to support and intend to support (and we have designed the system in such a way that we could support it), however we do not currently have the bandwidth to iron our all the user stories and examples in the short term. We would 100% welcome an implementation if this feature is something you are interested in contributing

In terms of what the feature will enable:

You should be able to see compression during the LoRA fine-tuning from 2:4 sparsity + quantization. However, since we have not added CUDA kernels for acceleration of 2:4 sparsity to compressed-tensors (only in vllm), you will not see kernel level speedup. However, the weight compression will increase the amount of batching you can do, which may help end-to-end speed
This feature is still valuable, since if you train a LoRA adapter on top of the 2:4 sparse model, if you deploy the model with the unmerged adapter, you will get the benefit of faster deployment in vLLM! This is a great user story!

The key item here will be working on our integration of compressed-tensors with HFQuantizer (https://huggingface.co/docs/transformers/quantization/compressed_tensors) and making sure it is compatible with LoRA training with HF peft. If you're interested in taking this on --- we can connect over slack or live to discuss scoping!

cc @horheynm and @dsikka

Sneak preview:

We are launching support for 2:4 + fp8 in vllm next week, so this feature could be very valuable for deploying on H100s.

Performance snapshot :)

arunpatala · 2024-12-09T12:45:20Z

Hi,

I’d be interested in contributing to the implementation of this feature. Please share the necessary details and pointers to help me get started. I would also appreciate it if you could verify my understanding:

Current Integration with Compressed-Tensors:
- As I understand, the base sparse model is not yet utilizing compressed-tensors. To make it compatible with Hugging Face (HF), we need to integrate HFQuantizer to load and save sparse models in a compressed format.
- Sparse + GPTQ models are already using compressed-tensors. Are these tensors loaded into GPU memory in a compressed format, or are they decompressed before being loaded?
- Does this suffice memory savings and increase batch size?
Inference and Training Acceleration:
- When you mention that the acceleration for 2:4 sparsity in inference is not yet added to compressed-tensors, does this also apply to training (e.g., QLoRA fine-tuning)?
- Does llm-compressor already have these necessary kernels?
LoRA Fine-Tuning with Sparsity:
- If we use a base sparse model, QLoRA fine-tuning should be straightforward, though without speed benefits.
- However, merging the LoRA adapter currently might result in the loss of sparsity. One potential solution is to mask the LoRA weights with the sparse weight mask during training. For example:
  output = SparseLinear(input) + mask ( Lora(input))
- This approach could enable LoRA fine-tuning of sparse models without sacrificing sparsity.
- Even if there is no speedup during LoRA training of sparse models, the merged model would retain sparsity, leading to faster inference when tuned on custom datasets.

Please point me what things are missing in current implementation. And where I could the related code.

Thanks
Arun

I have found the following related links:

HFQuantize
compressed_tensors
Quantization

compressed-tensors

marlin_24

dsikka · 2024-12-10T15:42:53Z

Hi @arunpatala:

Our sparse models are supported in compressed-tensors and we are currently in the process of enabling loading then through HFQuantizer through this PR: Run model as compressed/uncompressed mode huggingface/transformers#34719 (comment)
Compressed models are loaded in their compressed format and then each layer is decompressed before its forward pass. This is the case when run_compressed is set to True. When it is False, we decompress the entire model after loading.
The general lifecycle of how quantized parameters are updated can be seen through here: https://github.com/neuralmagic/compressed-tensors/blob/2dcbc9d1dd3f4dc29c280efab481b9f0cfde0a27/src/compressed_tensors/quantization/lifecycle/apply.py#L105
Generally speaking, neither llm-compressor nor compressed-tensors have CUDA kernels for acceleration. This is only in vllm. The focus of decompression in compressed-tensors is primarily for accuracy testing

arunpatala added the enhancement New feature or request label Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About lora finetuning of 2:4 sparse and sparse quant models #952

About lora finetuning of 2:4 sparse and sparse quant models #952

arunpatala commented Dec 4, 2024

robertgshaw2-neuralmagic commented Dec 6, 2024 •

edited

Loading

arunpatala commented Dec 9, 2024

dsikka commented Dec 10, 2024

About lora finetuning of 2:4 sparse and sparse quant models #952

About lora finetuning of 2:4 sparse and sparse quant models #952

Comments

arunpatala commented Dec 4, 2024

robertgshaw2-neuralmagic commented Dec 6, 2024 • edited Loading

Your understanding is correct

LoRA

Sneak preview:

arunpatala commented Dec 9, 2024

dsikka commented Dec 10, 2024

robertgshaw2-neuralmagic commented Dec 6, 2024 •

edited

Loading