[Kernel][RFC] Initial commit containing new Triton kernels for multi lora serving. #5356

FurtherAI · 2024-06-08T05:12:48Z

SGMV Triton Kernels

Should Fix #2829, #4007, #4053, #4063, #3793, #4708

Modification of #5025 to change to contiguous weight format rather than S-LoRA paged weights.
In parallel with #5036

New Triton kernels for multi lora computation. These (should) handle any shape and data type, compute at the actual lora rank and also speed up for grouped lora requests (especially prefill).

Replace Punica kernels.

ping @Yard1

…computation. These (should) handle any shape and data type, apply to loras in a paged format and compute at the actual lora rank and also speed up for grouped lora requests (especially prefill).

…nd data type and benefit from grouped LoRAs (i.e. grouped or prefill).

robertgshaw2-redhat · 2024-06-10T17:21:08Z

@tlrmchlsmth

tlrmchlsmth

Just starting to take a look -- Could you explain a bit about the change to a contiguous weight format?

benchmarks/kernels/benchmark_sgmv_triton.py

FurtherAI · 2024-06-11T02:20:52Z

@tlrmchlsmth So, S-LoRA puts the weights in a page that is [page_size, hidden_dim], each LoRA matrix has rank weights there. By contiguous, I mean going back to the format that is currently implemented for the Punica kernels, where the LoRA weights are stored in a stack that is padded to max rank (e.g. [max_loras, 1, hidden_dim, max_rank] for LoRA B)`.

@Yard1 requested this for simplicity

Yard1

This looks good - I think we can just merge it. Merging master should fix CI.

FurtherAI · 2024-06-11T03:05:00Z

@Yard1 You don't mean merge as in finish this PR already right? We still gotta remove punica

Yard1 · 2024-06-11T03:21:03Z

@FurtherAI We can merge the kernels first, and then follow up with a PR to enable them. But either works!

jeejeelee · 2024-06-11T06:26:37Z

@Yard1 I'm still waiting for your decision, but you haven't provided any feedback, so I don't quite understand.

jeejeelee · 2024-06-11T06:33:15Z

@robertgshaw2-neuralmagic @tlrmchlsmth I would appreciate it if you could pay attention to my related PR #5036

… Triton sgmv kernel.

…ogress.

FurtherAI · 2024-06-19T02:15:39Z

@Yard1 @tlrmchlsmth @robertgshaw2-neuralmagic
So, I am working on removing Punica and I think I've written all the code for it, working on testing now. Appreciate any pointers if anyone has any ideas about the errors I'm running into.

The first time I run a layer (currently testing column parallel in test_layers.py), it works. After that, it has a particular order of working and not working (seed is set to the same value at all times), being either off by 0.2 or 500 on average. This happens whether the layer is run in a loop or if the function is duplicated multiple times (they both repeat the same sequence). Pytest, Torch and Triton (autotuner disabled) all might persist something across different function calls, but I am baffled by this behavior, since I think they should be independent. Has anyone seen this before or have any ideas?

I pushed a commit with the current state.

Update

Fixed this, was due to using the tensor tracking lora ranks incorrectly. Triton was most likely accessing out of bounds but not throwing an error for it.

…emoved use of Punica kernels and allowed arbitrary lora extra vocab size.

FurtherAI · 2024-07-01T04:02:33Z

@Yard1 Looks like I need Triton >= 2.2.0. Where all should the requirements be updated for this?
Also, specifically the Intel XPU install is failing. Naming issue or something, know what it is?

chandan047 · 2024-08-20T21:00:22Z

Hi, our team is really looking forward to this merge! Is there an estimate on when this will be merged?

FurtherAI · 2024-08-20T21:21:09Z

@chandan047 This was largely taken over by #5036, though these kernels have the ability to compute at the actual rank of each LoRA, which I'm not sure was added in the other PR. They are also simpler, but I am not atm planning to try to merge this with the updated main. The other ones were merged into main so see if that works for your use case

simon-mo · 2024-10-22T22:36:54Z

Closed in favor of @jeejeelee's kernel. Thanks for the PR!

FurtherAI added 5 commits May 24, 2024 03:33

[Kernel] Initial commit containing new Triton kernels for multi lora …

3631e37

…computation. These (should) handle any shape and data type, apply to loras in a paged format and compute at the actual lora rank and also speed up for grouped lora requests (especially prefill).

[Bugfix] Add __init__.py to Triton kernel directory.

62423a2

Trigger CI

eac17c3

[Kernel] Triton kernels for multi-LoRA which are flexible wrt shape a…

ec5904d

…nd data type and benefit from grouped LoRAs (i.e. grouped or prefill).

[Misc] Fix formatting.

a21efb1

tlrmchlsmth reviewed Jun 10, 2024

View reviewed changes

benchmarks/kernels/benchmark_sgmv_triton.py Outdated Show resolved Hide resolved

benchmarks/kernels/benchmark_sgmv_triton.py Show resolved Hide resolved

Yard1 approved these changes Jun 11, 2024

View reviewed changes

Merge branch 'vllm-project:main' into sgmv_triton_contiguous

bb51359

NiuBlibing mentioned this pull request Jun 14, 2024

support load qwen2-72b-instruct lora #5498

Closed

FurtherAI added 3 commits June 14, 2024 23:26

[Refactor] Replace punica kernel calls in layers.py with calls to new…

6613b2c

… Triton sgmv kernel.

[Misc] Update so the autotuning does not change the output.

9b75131

[Refactor] Converted lora layers from Punica to Triton. Testing in pr…

0b134da

…ogress.

FurtherAI added 5 commits June 30, 2024 18:25

[Bugfix][Refactor] Fixed error with tracking ranks for sgmv kernel. R…

3c115ff

…emoved use of Punica kernels and allowed arbitrary lora extra vocab size.

[Merge] Merge main.

782ae1a

[Misc] Fix fully sharded layer which was still using bgmv.

55e539f

[Bugfix] Remove synchronize leftover from debugging.

a6711ab

[Dependencies] Upgrade Intel-XPU Triton version to latest wheel.

86bb92d

FurtherAI added 3 commits July 3, 2024 02:04

[Bugfix] Update test_lora.py to call apply_lora correctly.

f17c4dc

[Misc] Merge main.

e0fe923

[Bugfix] Minor bugfixes from tests.

92f8985

jeejeelee mentioned this pull request Aug 4, 2024

[Kernel][RFC] Refactor the punica kernel based on Triton #5036

Merged

3 tasks

simon-mo closed this Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel][RFC] Initial commit containing new Triton kernels for multi lora serving. #5356

[Kernel][RFC] Initial commit containing new Triton kernels for multi lora serving. #5356

FurtherAI commented Jun 8, 2024

robertgshaw2-redhat commented Jun 10, 2024

tlrmchlsmth left a comment

FurtherAI commented Jun 11, 2024 •

edited

Loading

Yard1 left a comment •

edited

Loading

FurtherAI commented Jun 11, 2024

Yard1 commented Jun 11, 2024

jeejeelee commented Jun 11, 2024

jeejeelee commented Jun 11, 2024 •

edited

Loading

FurtherAI commented Jun 19, 2024 •

edited

Loading

FurtherAI commented Jul 1, 2024 •

edited

Loading

chandan047 commented Aug 20, 2024

FurtherAI commented Aug 20, 2024

simon-mo commented Oct 22, 2024

[Kernel][RFC] Initial commit containing new Triton kernels for multi lora serving. #5356

[Kernel][RFC] Initial commit containing new Triton kernels for multi lora serving. #5356

Conversation

FurtherAI commented Jun 8, 2024

SGMV Triton Kernels

robertgshaw2-redhat commented Jun 10, 2024

tlrmchlsmth left a comment

Choose a reason for hiding this comment

FurtherAI commented Jun 11, 2024 • edited Loading

Yard1 left a comment • edited Loading

Choose a reason for hiding this comment

FurtherAI commented Jun 11, 2024

Yard1 commented Jun 11, 2024

jeejeelee commented Jun 11, 2024

jeejeelee commented Jun 11, 2024 • edited Loading

FurtherAI commented Jun 19, 2024 • edited Loading

Update

FurtherAI commented Jul 1, 2024 • edited Loading

chandan047 commented Aug 20, 2024

FurtherAI commented Aug 20, 2024

simon-mo commented Oct 22, 2024

FurtherAI commented Jun 11, 2024 •

edited

Loading

Yard1 left a comment •

edited

Loading

jeejeelee commented Jun 11, 2024 •

edited

Loading

FurtherAI commented Jun 19, 2024 •

edited

Loading

FurtherAI commented Jul 1, 2024 •

edited

Loading