Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel][RFC] Initial commit containing new Triton kernels for multi lora serving. #5356

Closed
wants to merge 17 commits into from

Conversation

FurtherAI
Copy link
Contributor

SGMV Triton Kernels

Should Fix #2829, #4007, #4053, #4063, #3793, #4708

Modification of #5025 to change to contiguous weight format rather than S-LoRA paged weights.
In parallel with #5036


New Triton kernels for multi lora computation. These (should) handle any shape and data type, compute at the actual lora rank and also speed up for grouped lora requests (especially prefill).

  • Replace Punica kernels.

ping @Yard1

FurtherAI added 5 commits May 24, 2024 03:33
…computation. These (should) handle any shape and data type, apply to loras in a paged format and compute at the actual lora rank and also speed up for grouped lora requests (especially prefill).
…nd data type and benefit from grouped LoRAs (i.e. grouped or prefill).
@robertgshaw2-redhat
Copy link
Collaborator

@tlrmchlsmth

Copy link
Collaborator

@tlrmchlsmth tlrmchlsmth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just starting to take a look -- Could you explain a bit about the change to a contiguous weight format?

benchmarks/kernels/benchmark_sgmv_triton.py Outdated Show resolved Hide resolved
benchmarks/kernels/benchmark_sgmv_triton.py Show resolved Hide resolved
@FurtherAI
Copy link
Contributor Author

FurtherAI commented Jun 11, 2024

@tlrmchlsmth So, S-LoRA puts the weights in a page that is [page_size, hidden_dim], each LoRA matrix has rank weights there. By contiguous, I mean going back to the format that is currently implemented for the Punica kernels, where the LoRA weights are stored in a stack that is padded to max rank (e.g. [max_loras, 1, hidden_dim, max_rank] for LoRA B)`.

@Yard1 requested this for simplicity

Copy link
Collaborator

@Yard1 Yard1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good - I think we can just merge it. Merging master should fix CI.

@FurtherAI
Copy link
Contributor Author

@Yard1 You don't mean merge as in finish this PR already right? We still gotta remove punica

@Yard1
Copy link
Collaborator

Yard1 commented Jun 11, 2024

@FurtherAI We can merge the kernels first, and then follow up with a PR to enable them. But either works!

@jeejeelee
Copy link
Collaborator

@Yard1 I'm still waiting for your decision, but you haven't provided any feedback, so I don't quite understand.

@jeejeelee
Copy link
Collaborator

jeejeelee commented Jun 11, 2024

@robertgshaw2-neuralmagic @tlrmchlsmth I would appreciate it if you could pay attention to my related PR #5036

@FurtherAI
Copy link
Contributor Author

FurtherAI commented Jun 19, 2024

@Yard1 @tlrmchlsmth @robertgshaw2-neuralmagic
So, I am working on removing Punica and I think I've written all the code for it, working on testing now. Appreciate any pointers if anyone has any ideas about the errors I'm running into.

The first time I run a layer (currently testing column parallel in test_layers.py), it works. After that, it has a particular order of working and not working (seed is set to the same value at all times), being either off by 0.2 or 500 on average. This happens whether the layer is run in a loop or if the function is duplicated multiple times (they both repeat the same sequence). Pytest, Torch and Triton (autotuner disabled) all might persist something across different function calls, but I am baffled by this behavior, since I think they should be independent. Has anyone seen this before or have any ideas?

I pushed a commit with the current state.

Update

Fixed this, was due to using the tensor tracking lora ranks incorrectly. Triton was most likely accessing out of bounds but not throwing an error for it.

@FurtherAI
Copy link
Contributor Author

FurtherAI commented Jul 1, 2024

@Yard1 Looks like I need Triton >= 2.2.0. Where all should the requirements be updated for this?
Also, specifically the Intel XPU install is failing. Naming issue or something, know what it is?

@chandan047
Copy link

Hi, our team is really looking forward to this merge! Is there an estimate on when this will be merged?

@FurtherAI
Copy link
Contributor Author

@chandan047 This was largely taken over by #5036, though these kernels have the ability to compute at the actual rank of each LoRA, which I'm not sure was added in the other PR. They are also simpler, but I am not atm planning to try to merge this with the updated main. The other ones were merged into main so see if that works for your use case

@simon-mo
Copy link
Collaborator

Closed in favor of @jeejeelee's kernel. Thanks for the PR!

@simon-mo simon-mo closed this Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Performance] 40% performance drop using lora vs no lora
7 participants