-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About lora finetuning of 2:4 sparse and sparse quant models #952
Comments
Hey @arunpatala Your understanding is correct
LoRA
In terms of what the feature will enable:
The key item here will be working on our integration of Sneak preview:We are launching support for 2:4 + fp8 in |
Hi, I’d be interested in contributing to the implementation of this feature. Please share the necessary details and pointers to help me get started. I would also appreciate it if you could verify my understanding:
Please point me what things are missing in current implementation. And where I could the related code. Thanks I have found the following related links: HFQuantize |
Hi @arunpatala:
|
I would like to thank for a great repo.
I have been testing the newly released sparse quant models and was amazed by speedup in both latency and throughput.
I just have some doubts regarding finetuning of 2:4 sparse models.
From what i understood, the model is first sparsified and then fully trained on some data to create sparse llama base model.
As this is not instruction tuned, we do another finetuning on instruction data (which is much smaller). But this still takes as much memory but lesser time.
The recipe provided in the examples, starts with a dense model and does sparsification based on calibration data. Then fine tuning is applied to create the sparse model to regain accuracy.
I would like to know if we can start with a sparse base model (like Sparse Llama 3.1 8B), and create a lora adapter using custom dataset. There can also be sparsity speedup for training lora adapters, is this possible? This would take a lot memory than finetuning step after sparsification.
Does this make sense, assuming VLLM supports serving sparse models with lora?
Can all this be also applied to sparse + w4a16 models to get Qlora +sparsity training and inference?
I would like to contribute if anyone points me in the right direction.
Thanks
Arun
The text was updated successfully, but these errors were encountered: