Backpropagation for GEMM #220

s4rduk4r · 2023-11-21T18:08:05Z

This PR implements backpropagation for GEMM version.

CUDA kernel for GEMM backpropagation (modification of forward pass kernel) named gemm_backward_cuda
AWQ LoRA objects
Runtime patch of PEFT to enable LoRA finetune
Example script awq_autograd.py of finetune

Below is result of LoRA finetune on 200 entries from OpenAssistant/oasst1 dataset

1. CUDA kernel for GEMM backpropagation 2. AWQ_LoRA objects 3. Runtime patch of PEFT

casper-hansen · 2023-11-21T18:46:36Z

Wow, this looks super interesting!

Have you profiled how fast this is compared to GPTQ/BitsAndBytes?
What does the memory usage look like compared to GPTQ/BitsAndBytes?

awq_cuda/quantization/gemm_cuda_gen.cu

vllm-project/vllm#1295

s4rduk4r · 2023-11-22T17:15:04Z

Wow, this looks super interesting!

* Have you profiled how fast this is compared to GPTQ/BitsAndBytes?

* What does the memory usage look like compared to GPTQ/BitsAndBytes?

Thank you. I haven't profiled performance and memory usage against GPTQ or BitsAndBytes. Not sure how to measure memory consumption, but training times against GPTQ can do. Probably during this weekend I'll be able to measure times

casper-hansen · 2023-11-22T17:16:06Z

I looked into your kernel. It seems it is not computing any gradients yet you call it backward. Additionally, I see the loss in your screenshot is increasing. Could it be that you have not implemented the backward pass yet?

s4rduk4r · 2023-11-22T17:34:04Z

I looked into your kernel. It seems it is not computing any gradients yet you call it backward. Additionally, I see the loss in your screenshot is increasing. Could it be that you have not implemented the backward pass yet?

You're right. This kernel doesn't compute gradients, because I rely on non-fused layers without @torch.no_grad decorator, so the gradients are being calculated by PyTorch and fed to the WQLinear_GEMM_Propagator.backward() method. It's being called by trainer on a back pass

casper-hansen · 2023-11-22T17:45:00Z

I looked into your kernel. It seems it is not computing any gradients yet you call it backward. Additionally, I see the loss in your screenshot is increasing. Could it be that you have not implemented the backward pass yet?

You're right. This kernel doesn't compute gradients, because I rely on non-fused layers without @torch.no_grad decorator, so the gradients are being calculated by PyTorch and fed to the WQLinear_GEMM_Propagator.backward() method. It's being called by trainer on a back pass

Right, so the backward is just doing matrix multiplication. Do we need two new kernels for that or could we reuse the existing one's?

For reference, there was another attempt at a backward pass. This one runs the matrix multiplications in FP16 though.

https://github.com/compressa-ai/AutoAWQ/blob/dev/awq/modules/linear.py#L60-L88

s4rduk4r · 2023-11-22T17:51:19Z

I looked into your kernel. It seems it is not computing any gradients yet you call it backward. Additionally, I see the loss in your screenshot is increasing. Could it be that you have not implemented the backward pass yet?

You're right. This kernel doesn't compute gradients, because I rely on non-fused layers without @torch.no_grad decorator, so the gradients are being calculated by PyTorch and fed to the WQLinear_GEMM_Propagator.backward() method. It's being called by trainer on a back pass

Right, so the backward is just doing matrix multiplication. Do we need two new kernels for that or could we reuse the existing one's?

For reference, there was another attempt at a backward pass. This one runs the matrix multiplications in FP16 though.

https://github.com/compressa-ai/AutoAWQ/blob/dev/awq/modules/linear.py#L60-L88

I think we can introduce a matrix transpose flag (default = false) into the forward pass kernels which will be called by the gemm_backward_cuda with this flag set to true. But if in the future someone would want to train fused layers, then, I think, two dedicated back pass kernels will be needed

casper-hansen · 2023-11-22T18:03:38Z

I looked into your kernel. It seems it is not computing any gradients yet you call it backward. Additionally, I see the loss in your screenshot is increasing. Could it be that you have not implemented the backward pass yet?

You're right. This kernel doesn't compute gradients, because I rely on non-fused layers without @torch.no_grad decorator, so the gradients are being calculated by PyTorch and fed to the WQLinear_GEMM_Propagator.backward() method. It's being called by trainer on a back pass

Right, so the backward is just doing matrix multiplication. Do we need two new kernels for that or could we reuse the existing one's?
For reference, there was another attempt at a backward pass. This one runs the matrix multiplications in FP16 though.
https://github.com/compressa-ai/AutoAWQ/blob/dev/awq/modules/linear.py#L60-L88

I think we can introduce a matrix transpose flag (default = false) into the forward pass kernels which will be called by the gemm_backward_cuda with this flag set to true. But if in the future someone would want to train fused layers, then, I think, two dedicated back pass kernels will be needed

I would expect users to make use of axolotl which has its way of fusing/patching layers. I implemented fused layers for training before, but there is not much of a benefit when we are compute-bound during training.

s4rduk4r · 2023-12-03T08:04:19Z

I looked into your kernel. It seems it is not computing any gradients yet you call it backward. Additionally, I see the loss in your screenshot is increasing. Could it be that you have not implemented the backward pass yet?

You're right. This kernel doesn't compute gradients, because I rely on non-fused layers without @torch.no_grad decorator, so the gradients are being calculated by PyTorch and fed to the WQLinear_GEMM_Propagator.backward() method. It's being called by trainer on a back pass

Right, so the backward is just doing matrix multiplication. Do we need two new kernels for that or could we reuse the existing one's?

For reference, there was another attempt at a backward pass. This one runs the matrix multiplications in FP16 though.

https://github.com/compressa-ai/AutoAWQ/blob/dev/awq/modules/linear.py#L60-L88

The solution in the link works, but I constantly fail to reproduce it in CUDA code. Probably I'm not that good at this kind of tasks. So I think it will be better to either incorporate the solution from the link as it is or just close this PR, because I won't be able to bring it to the good shape :(

casper-hansen · 2023-12-03T11:20:51Z

What I wonder most about is loss stability. Can you train a model that works?

In your image here, you only trained for a few steps, but the model did not improve.

s4rduk4r · 2023-12-04T09:27:03Z

What I wonder most about is loss stability. Can you train a model that works?

In your image here, you only trained for a few steps, but the model did not improve.

You're right. Loss just waltz around 2.6 and never decreases over long period of time. But solution by @compressa-ai trains model. Here's a fresh screenshot done with the @compressa-ai's solution for 1 epoch. Loss also can slightly increase in the process, but overall it decreases towards zero. Also, I was able to produce LoRA module with it

I think this exact PR has to be closed for the @compressa-ai's solution to be done as PR

casper-hansen · 2023-12-05T19:18:35Z

If you could add his dequantize weights code and remove the backward kernel, then we could get this PR merged. Might as well merge it now that you have done the other work with training code etc.

s4rduk4r · 2023-12-06T18:14:14Z

@casper-hansen Removed backward kernel and added dequantization kernel. Also noticed that runtime patch to peft doesn't work with recent versions (0.6.x), but with peft 0.3.0 it works. For now can't find a way how to patch it in runtime. Maybe there should be done PR to peft to include AutoAWQQuantLinear like it has happened with AutoGPTQQuantLinear

casper-hansen · 2023-12-08T10:30:59Z

It looks great. I think we need a PEFT integration to make this fully work. They just need to import and replace modules similarly to how they are already doing it for GPTQ.

casper-hansen · 2023-12-13T16:13:32Z

I'm looking to release this approximately in v0.2.0 of AutoAWQ. Currently, I am first trying to add support for Mixtral before merging this and other PRs.

vLLM is working on a new Triton kernel that scales even better than the original GEMM kernel. Perhaps we can create a backward pass once the vLLM one is done.

https://github.com/vllm-project/vllm/blob/qmm/vllm/model_executor/layers/quantization/ops/awq.py

s4rduk4r · 2023-12-13T20:00:35Z

Triton kernel sounds like a great idea. As of integration with peft, I didn't have time to look into peft yet. Maybe I'll be able to during this weekend.

casper-hansen · 2024-01-21T23:06:16Z

@s4rduk4r did you forget to push your latest training script? The current one gives me a loss of zero after some modifications to use Mistral.

s4rduk4r · 2024-01-25T08:39:56Z

@s4rduk4r did you forget to push your latest training script? The current one gives me a loss of zero after some modifications to use Mistral.

Sorry for not replying sooner. The training script hasn't been updated. Truth be told I tested it only against Llama2 model and it worked at the time. Maybe there something has to be done differently. Additionally I have almost no time right now and won't be able to delve into this issue deeper. I'm sorry.

s4rduk4r added 3 commits November 21, 2023 18:53

AWQ GEMM training

b6e704d

1. CUDA kernel for GEMM backpropagation 2. AWQ_LoRA objects 3. Runtime patch of PEFT

AWQ GEMM backpropagation

c9bacea

Add files via upload

23fabb5

casper-hansen reviewed Nov 21, 2023

View reviewed changes

awq_cuda/quantization/gemm_cuda_gen.cu Outdated Show resolved Hide resolved

Fix overflow in AWQ kernel

74fc678

vllm-project/vllm#1295

Backprop in fp16 via dequantization by compressa-ai

7e471bb

Merge branch 'main' into pr/220

3c39f7d

casper-hansen added 2 commits January 21, 2024 23:06

Update

52ddc3a

Simplify dataset preprocessing

682d0c7

casper-hansen mentioned this pull request Jan 26, 2024

PEFT compatible GEMM #324

Merged

younesbelkada mentioned this pull request Jan 26, 2024

FEAT: add awq suppot in PEFT huggingface/peft#1399

Merged

5 tasks

casper-hansen closed this Feb 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backpropagation for GEMM #220

Backpropagation for GEMM #220

s4rduk4r commented Nov 21, 2023 •

edited

Loading

casper-hansen commented Nov 21, 2023

s4rduk4r commented Nov 22, 2023

casper-hansen commented Nov 22, 2023

s4rduk4r commented Nov 22, 2023

casper-hansen commented Nov 22, 2023 •

edited

Loading

s4rduk4r commented Nov 22, 2023

casper-hansen commented Nov 22, 2023

s4rduk4r commented Dec 3, 2023

casper-hansen commented Dec 3, 2023

s4rduk4r commented Dec 4, 2023 •

edited

Loading

casper-hansen commented Dec 5, 2023

s4rduk4r commented Dec 6, 2023

casper-hansen commented Dec 8, 2023

casper-hansen commented Dec 13, 2023

s4rduk4r commented Dec 13, 2023

casper-hansen commented Jan 21, 2024

s4rduk4r commented Jan 25, 2024

Backpropagation for GEMM #220

Backpropagation for GEMM #220

Conversation

s4rduk4r commented Nov 21, 2023 • edited Loading

casper-hansen commented Nov 21, 2023

s4rduk4r commented Nov 22, 2023

casper-hansen commented Nov 22, 2023

s4rduk4r commented Nov 22, 2023

casper-hansen commented Nov 22, 2023 • edited Loading

s4rduk4r commented Nov 22, 2023

casper-hansen commented Nov 22, 2023

s4rduk4r commented Dec 3, 2023

casper-hansen commented Dec 3, 2023

s4rduk4r commented Dec 4, 2023 • edited Loading

casper-hansen commented Dec 5, 2023

s4rduk4r commented Dec 6, 2023

casper-hansen commented Dec 8, 2023

casper-hansen commented Dec 13, 2023

s4rduk4r commented Dec 13, 2023

casper-hansen commented Jan 21, 2024

s4rduk4r commented Jan 25, 2024

s4rduk4r commented Nov 21, 2023 •

edited

Loading

casper-hansen commented Nov 22, 2023 •

edited

Loading

s4rduk4r commented Dec 4, 2023 •

edited

Loading