[Model][Jamba] Mamba cache single buffer #6739

mzusman · 2024-07-24T10:12:57Z

By carefully allocating the Mamba cache at the first "n" slots in the mamba cache before FWD pass ,
We can now remove the redundant CG Mamba buffer.
This PR saves memory, simplifies the Jamba inner state management code and accelerates latency (by removing redundant data copies).

This PR is also applicable to #6484 @tlrmchlsmth .

github-actions · 2024-07-24T10:13:08Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

mzusman · 2024-07-24T10:19:04Z

/ready

mzusman · 2024-07-28T09:17:43Z

PR is ready, CI failures are not related to this PR.

tlrmchlsmth

Just starting to read through now. At a high level, the approach makes sense to me. Do you anticipate any cases where we'll end up shuffling a lot of data in and out of the first N slots?

And do you have any end-to-end performance numbers you can share?

You mentioned an added test for parallel sampling, but it's not present in this PR. Did you mean to remove it? I noticed that the added test was there previously

mzusman · 2024-08-04T14:17:07Z

Thank you for the review! Sorry for the long delay.

Most shuffling occurs during the transition from prefill steps to decoding steps. However, shuffling between sequential decoding steps ( which populate the majority of the steps distribution under a regular load ) doesn't happen very often since the cache is already in place ( previous implementation copied the mamba cache from buffer to buffer in each and every decode step).

And regarding end-to-end perf - so yeah, we benchmark prefill and decoding forward passes independently. We've seen 1-2 ms speed up in decoding, and no change in prefill steps. However, the major purpose of this PR is to reduce the memory usage.

Red line is the previous implementation, blue line is this PR implementation.

RE - Parallel sampling test. Yeah, I've intended to add it but the tiny Jamba model we use for unittest behaves differently on different devices. So I've left it out for now until we have a trained tiny model for tests.

tlrmchlsmth

I left a few comments in-line. Generally, I think the approach makes sense and don't see any specific problems, but I think we should get somebody working on multi-step scheduling to review in case any conflicts might arise there. @alexm-neuralmagic could you look into that and suggest other reviewers as well?

I think the functions that manage the mamba cache might be better organized if they were factored out and encapsulated in their own class. I was thinking we could try to make it behave similarly to the BlockManager in terms of interface. Goal would be to incrementally make the mamba cache fit into vLLM's native systems. Doesn't have to be in this PR but curious to hear your thoughts on this.

One last question: A lot of this would be simpler if the two mamba cache update functions took a list of indices rather than requiring contiguous tensors. Have you looked into this at all? To me it looks like it wouldn't be too technically difficult to do, but would require a pair of PRs on https://github.com/Dao-AILab/causal-conv1d and https://github.com/state-spaces/mamba. Might be worth it just to avoid the state management.

vllm/model_executor/models/jamba.py

mzusman · 2024-08-08T13:29:53Z

I left a few comments in-line. Generally, I think the approach makes sense and don't see any specific problems, but I think we should get somebody working on multi-step scheduling to review in case any conflicts might arise there. @alexm-neuralmagic could you look into that and suggest other reviewers as well?

I think the functions that manage the mamba cache might be better organized if they were factored out and encapsulated in their own class. I was thinking we could try to make it behave similarly to the BlockManager in terms of interface. Goal would be to incrementally make the mamba cache fit into vLLM's native systems. Doesn't have to be in this PR but curious to hear your thoughts on this.

One last question: A lot of this would be simpler if the two mamba cache update functions took a list of indices rather than requiring contiguous tensors. Have you looked into this at all? To me it looks like it wouldn't be too technically difficult to do, but would require a pair of PRs on https://github.com/Dao-AILab/causal-conv1d and https://github.com/state-spaces/mamba. Might be worth it just to avoid the state management.

Thank you for the review! @alexm-neuralmagic would love to hear your opinion.
RE I totally agree that an ideal solution would be a BlockManager subtype that would also be able to deal space states cache.
This would definitely will require us to add a functionality to the mamba/casual-conv1d kernels to make them take list of indices in addition to the mamba cache. However, We haven't had any work in this area yet, mostly since it doesn't improve performance.
We would love to see future improvements that would allow us to use vLLM's native systems.

alexm-neuralmagic · 2024-08-08T13:45:38Z

@mzusman @tlrmchlsmth Did a quick pass over the PR and I see that the changes are inside the forward() function of the model itself. The multi-step logic is "above" this function, so I don't think it should interfere with the changes here. Btw, nice optimization!

tlrmchlsmth · 2024-08-08T16:08:19Z

@mzusman FYI I am working on modifying the kernels to take a tensor of indices for the batch coordinates. I think this branch gives us the interface we'd need to avoid all of the state copying for causal_conv1d_update:
Dao-AILab/causal-conv1d@main...neuralmagic:causal-conv1d:tms/list_causal_conv1d_update

Going to try to do the same thing to selective_state_update as well. I think it would make sense to go for this approach instead if we can make the kernel updates usable quickly enough, but landing this PR is obviously more expedient. Curious to hear your urgency on getting this improvement landed.

mzusman · 2024-08-08T19:40:49Z

That's really great! Cache management will be easier to handle.

That's right, landing this PR is quite urgent for us at the moment and does not block future improvements. I think it would be better to split those improvements/PRs.
I suggest we land this PR first, then add the adaptations from the Mamba kernel adjustments you will add in the next PR.

tlrmchlsmth · 2024-08-09T01:17:32Z

FYI I just restarted the failed jobs

tlrmchlsmth

Does it make sense to add unit tests for the utils that maintain the cache? Seems like they're complicated enough to want additional testing. Beyond that, LGTM if green

mzusman · 2024-08-09T08:36:40Z

I think it makes sense to add unittests that test the cache management utils,
I guess we will add them in the future in a following PR ,
since at the moment we've already got unittests that test the cache mechanism indirectly, like this one that verifies that generations are correct ,and that batching is also correct . and this one that verifies state cleanup and more

RE CI - I'll rebase, maybe it will help, failures doesn't seems to relate to this PR.

* WIP - working on swaping indices * WIP * Save changes * Orginize indices during assigment, working and passing tests! * Add TODOs * Remove diff * Format * Remove TODOs * Remove unused code * Cleanup * Cleanup * Cleanup the redundant 10 blocks * Small changes * Simplify code and add comments * Renaming and simplify * Remove return * Clean up * Cleanup * Renaming * Another clean up * Clean up * Clean up and simplify more * Add n > 1 test * Format * cosmetics * Add functionality to find first free * Raise exception if could not find spot * Typos * Add 2 slots as precaution --------- Co-authored-by: Mor Zusman <morz@ai21.com>

This reverts commit 381c2aa.

This reverts commit f1e792d.

This reverts commit bda9876.

tlrmchlsmth · 2024-08-09T14:06:53Z

Going to merge this one, and then try to simplify with updated kernels :)

Thanks!

Co-authored-by: Mor Zusman <morz@ai21.com>

Co-authored-by: Mor Zusman <morz@ai21.com> Signed-off-by: Alvant <alvasian@yandex.ru>

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 24, 2024

mzusman force-pushed the mamba_cache_single_buffer_upstream branch from dc9bf07 to d57ccb6 Compare July 28, 2024 07:41

tlrmchlsmth reviewed Jul 31, 2024

View reviewed changes

tlrmchlsmth reviewed Aug 7, 2024

View reviewed changes

vllm/model_executor/models/jamba.py Show resolved Hide resolved

vllm/model_executor/models/jamba.py Outdated Show resolved Hide resolved

vllm/model_executor/models/jamba.py Outdated Show resolved Hide resolved

tlrmchlsmth reviewed Aug 9, 2024

View reviewed changes

mzusman and others added 15 commits August 9, 2024 11:40

Format

b9ef930

Change example

f9d311d

Change tested model (trained), now the tests are more reliable

69c0da8

Bugfix, the dest index didn't run on the seq ids

9a3a1be

Clean up

c705ed2

Revert "Clean up"

7fd4e22

This reverts commit 381c2aa.

Revert "Bugfix, the dest index didn't run on the seq ids"

7f97c4e

This reverts commit f1e792d.

Revert "Change tested model (trained), now the tests are more reliable"

52239d0

This reverts commit bda9876.

Bugfix, the dest index didn't run on the seq ids

27a15e4

Cleanup

d7d07fb

Prettier version

12d8648

Half instead of bf16

4fc3dce

Formattin

f2c7723

Change test to float

44788c4

mzusman added 7 commits August 9, 2024 11:41

bf16 for the test

e598d96

Remove n > 1 test for now, need to check why it fails on L4

7d553c9

Format

df269e5

Factor out moving out the occupied index

60857a3

Add comment

9e583d6

Format

c2e9a1d

Jamba model

3eeeeb7

mzusman force-pushed the mamba_cache_single_buffer_upstream branch from 2f5293b to 3eeeeb7 Compare August 9, 2024 08:42

tlrmchlsmth approved these changes Aug 9, 2024

View reviewed changes

tlrmchlsmth merged commit 07ab160 into vllm-project:main Aug 9, 2024
48 checks passed

This was referenced Aug 12, 2024

Change interface to causal_conv1d_update for continuous batching Dao-AILab/causal-conv1d#29

Open

Change interface to selective_state_update for continuous batching state-spaces/mamba#521

Open

Simplify Jamba state management #7428

Closed

sfc-gh-mkeralapura pushed a commit to sfc-gh-mkeralapura/vllm that referenced this pull request Aug 12, 2024

[Model][Jamba] Mamba cache single buffer (vllm-project#6739)

441f01f

Co-authored-by: Mor Zusman <morz@ai21.com>

kylesayrs pushed a commit to neuralmagic/vllm that referenced this pull request Aug 17, 2024

[Model][Jamba] Mamba cache single buffer (vllm-project#6739)

834b1bb

Co-authored-by: Mor Zusman <morz@ai21.com>

fialhocoelho pushed a commit to opendatahub-io/vllm that referenced this pull request Aug 22, 2024

[Model][Jamba] Mamba cache single buffer (vllm-project#6739)

1315c24

Co-authored-by: Mor Zusman <morz@ai21.com>

This was referenced Aug 29, 2024

[Kernel] Change interface to Mamba causal_conv1d_update for continuous batching #8012

Merged

[Kernel] Change interface to Mamba selective_state_update for continuous batching #8039

Merged

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[Model][Jamba] Mamba cache single buffer (vllm-project#6739)

9c755e6

Co-authored-by: Mor Zusman <morz@ai21.com> Signed-off-by: Alvant <alvasian@yandex.ru>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model][Jamba] Mamba cache single buffer #6739

[Model][Jamba] Mamba cache single buffer #6739

mzusman commented Jul 24, 2024 •

edited

Loading

github-actions bot commented Jul 24, 2024

mzusman commented Jul 24, 2024

mzusman commented Jul 28, 2024

tlrmchlsmth left a comment

mzusman commented Aug 4, 2024 •

edited

Loading

tlrmchlsmth left a comment

mzusman commented Aug 8, 2024

alexm-neuralmagic commented Aug 8, 2024

tlrmchlsmth commented Aug 8, 2024 •

edited

Loading

mzusman commented Aug 8, 2024

tlrmchlsmth commented Aug 9, 2024

tlrmchlsmth left a comment

mzusman commented Aug 9, 2024

tlrmchlsmth commented Aug 9, 2024

[Model][Jamba] Mamba cache single buffer #6739

[Model][Jamba] Mamba cache single buffer #6739

Conversation

mzusman commented Jul 24, 2024 • edited Loading

github-actions bot commented Jul 24, 2024

mzusman commented Jul 24, 2024

mzusman commented Jul 28, 2024

tlrmchlsmth left a comment

Choose a reason for hiding this comment

mzusman commented Aug 4, 2024 • edited Loading

tlrmchlsmth left a comment

Choose a reason for hiding this comment

mzusman commented Aug 8, 2024

alexm-neuralmagic commented Aug 8, 2024

tlrmchlsmth commented Aug 8, 2024 • edited Loading

mzusman commented Aug 8, 2024

tlrmchlsmth commented Aug 9, 2024

tlrmchlsmth left a comment

Choose a reason for hiding this comment

mzusman commented Aug 9, 2024

tlrmchlsmth commented Aug 9, 2024

mzusman commented Jul 24, 2024 •

edited

Loading

mzusman commented Aug 4, 2024 •

edited

Loading

tlrmchlsmth commented Aug 8, 2024 •

edited

Loading