-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernel][RFC] Initial commit containing new Triton kernels for multi lora serving. #5356
Conversation
…computation. These (should) handle any shape and data type, apply to loras in a paged format and compute at the actual lora rank and also speed up for grouped lora requests (especially prefill).
…nd data type and benefit from grouped LoRAs (i.e. grouped or prefill).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just starting to take a look -- Could you explain a bit about the change to a contiguous weight format?
@tlrmchlsmth So, S-LoRA puts the weights in a page that is @Yard1 requested this for simplicity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good - I think we can just merge it. Merging master should fix CI.
@Yard1 You don't mean merge as in finish this PR already right? We still gotta remove punica |
@FurtherAI We can merge the kernels first, and then follow up with a PR to enable them. But either works! |
@Yard1 I'm still waiting for your decision, but you haven't provided any feedback, so I don't quite understand. |
@robertgshaw2-neuralmagic @tlrmchlsmth I would appreciate it if you could pay attention to my related PR #5036 |
@Yard1 @tlrmchlsmth @robertgshaw2-neuralmagic The first time I run a layer (currently testing column parallel in I pushed a commit with the current state. UpdateFixed this, was due to using the tensor tracking lora ranks incorrectly. Triton was most likely accessing out of bounds but not throwing an error for it. |
…emoved use of Punica kernels and allowed arbitrary lora extra vocab size.
@Yard1 Looks like I need Triton >= 2.2.0. Where all should the requirements be updated for this? |
Hi, our team is really looking forward to this merge! Is there an estimate on when this will be merged? |
@chandan047 This was largely taken over by #5036, though these kernels have the ability to compute at the actual rank of each LoRA, which I'm not sure was added in the other PR. They are also simpler, but I am not atm planning to try to merge this with the updated main. The other ones were merged into main so see if that works for your use case |
Closed in favor of @jeejeelee's kernel. Thanks for the PR! |
SGMV Triton Kernels
Should Fix #2829, #4007, #4053, #4063, #3793, #4708
Modification of #5025 to change to contiguous weight format rather than S-LoRA paged weights.
In parallel with #5036
New Triton kernels for multi lora computation. These (should) handle any shape and data type, compute at the actual lora rank and also speed up for grouped lora requests (especially prefill).
ping @Yard1