llama : support Mamba Selective State Space Models #5328

compilade · 2024-02-05T01:09:14Z

Note

Some changes made between this was opened and when this was merged required re-converting previously-converted GGUF Mamba models.

2024-02-28: using Mamba-specific GGUF key-values instead of (mis)using attention ones 709ea7d
2024-03-07: rename metadata to be more similar to transformers library 17e4d6c

This should fix #4353

Implementing Mamba in llama.cpp took more time than I thought. But it's here! See the TODO section below for a glimpse of some of the challenges. CPU-only for now.

I started working on this as an experiment and because I wanted to try Mamba models with llama.cpp (also, there have been quite a few finetunes already).
Turns out that implementing support for a novel model architecture is quite fun (well, at least when it finally works).
The most powerful machine on which I try LLMs is a low-power laptop with 8GB of ram and an Intel CPU (no discrete GPU), so I can't try Mamba-3B in its full f32 glory (the full weights take 11GB), but at least now it's possible to use it quantized.

Constant memory usage is a big advantage of Mamba models, but this also means that previous states are not all kept in memory (at least in the current implementation, only the last one is kept), which means there might be more prompt re-processing than necessary in the server example, especially if your client trims the end of the output (it's also problematic that the stop token(s) are not included in the server's responses). The main example has no such problem.

Currently, the initial text generation speed for Mamba is a bit slower than for Transformer-based models (with empty context), but unlike them, Mamba's speed does not degrade with the amount of tokens processed.
Also note that quantization may make the state unstable (making the output gibberish), but this needs more testing to figure out how much this happens (because I only saw it happen with very small models (130M), and not yet with bigger ones (3B)).

For testing, I recommend converting from https://huggingface.co/state-spaces/mamba-130m-hf since it's small, the config.json doesn't require modification, the tokenizer is already next to the model files, and the token_embd weight is shared with the output weight, so the download is smaller.

(EDIT: the following paragraph was written before the re-release of the Mamba models with more metadata (see the mamba-hf collection). Converting these should be more straightforward.)
The official models require modifying their config.json to add the line "architectures": ["MambaForCausalLM"], or "architectures": ["MambaLMHeadModel"], (either should work). The vocab will automatically come from llama.cpp/models/ggml-vocab-gpt-neox.gguf as there are no tokenizer files in the official Mamba model directories (at least, for the non-mamba-hf repositories).

Design decisions

I'd like to discuss a few things before this can be merged:

Tensor names
- I added new tensor types for Mamba, because the weights don't have exactly the same roles as in Transformers, but some of them might be redundant. For example, instead of ssm_out, I could probably have re-used attn_output, but then its relationship with ssm_in would have been less obvious. Conversely, it's not really attention, but I still re-used the attn_norm type for the layer norms.
Comments
- I might have put too many comments. Some of them are explanations of what's going on, others are to help me refactor more easily (like the comments describing the changes in tensor dimensions).
Metadata and convert script
- (ab)use of the KV cache metadata
  - Currently, the metadata for HEAD_COUNT, KEY_LENGHT and VALUE_LENGTH are used purely for making the KV cache the right size, depending on d_conv and d_state sizes (usually 4 and 16, respectively). This is probably wrong, since changing anything about the cache size (like I did in 7016fe5) breaks existing converted-to-GGUF Mamba models. Fixed by 709ea7d
- What context length should be stored in the model? The one they trained with is not even in config.json next to their model weights, and the effective context length is bigger than that anyway. Should I put a huge number like $2^{20}$ (1048576), or should I put the sequence length with which they say they trained their models in the paper (2048)?
- Speaking of config.json, the official Mamba models don't have an architectures field, which makes the model type hard to detect. For now, I've resorted to expecting "architectures": ["MambaForCausalLM"], in there, since the Q-bert/Mamba-* models are the only ones I've found which have an actual architecture defined in the config.json. Another architecture name which I've come across is MambaLMHeadModel, but it has not been used in config.json of any Mamba models I've looked for (I might have missed some). It seems like the class name of the official Mamba implementation, and I first saw it in the description of their 3B model trained on SlimPajama.
- The official Mamba models use the GPT-NeoX tokenizer, but don't include the tokenizer at all along with their weights. There is a way to detect the absence of tokenizer.json and use llama.cpp's models/ggml-vocab-gpt-neox.gguf, and I did exactly that.
Quantization
- Currently, the _S, _M and _L variants of k-quants are the same, because I don't yet know which weights are more (or less) important. This will require experimentation.
- A thing I've noticed is that the more a Mamba model is quantized, the more likely the model will output gibberish (especially with smaller models like mamba-130m at Q4_K). It would be nice to find a quant mix which alleviates this.
Performance
- I don't have a GPU, so all of my numbers are for CPU inference.
  - I did not implement GPU kernels for the new operators I added (I could not test them anyway).
- In the paper, they compare the performance with Transformers for 128 tokens after 2048 tokens have been put in the context. So my speed comparisons with empty-context-Transformers are not directly comparable with theirs (and they use GPUs).
- Most of the CPU time is spent on matrix multiplications from the linear projections (there are lots of layers in Mamba models (64 in Mamba-3B))
- The parallel scan step is more expensive than the conv step.
- I fused operations together in ggml_ssm_scan (and got a 25% perf boost on Mamba-3B compared to not fusing the operations), so I also removed my addition of the ggml_exp and ggml_soft_plus operators, since they are now unused.
- I also fused the operations of the conv step in ggml_ssm_conv because managing the states of simultaneous sequences was easier that way.
- Memory usage depends not on context length but on batch size and on the number of parallel sequences. If memory is precious, use a smaller batch size and/or fewer parallel sequences.

TODO

Things that should (probably) be done before merging, from more important to less important:

Better support the KV cache API: should work correctly when the functions are used on whole sequences.
- Replace usages of llama_kv_cache_seq_rm on parts of sequences to an equivalent way done using whole sequences (required for at least the server and parallel examples)
  - For now, the speculative and lookahead examples remain unsupported with Mamba models. This will probably be in a separate PR.
- Limit how many seq_id the perplexity example uses
Simultaneous sequence processing (required for the parallel example, the HellaSwag benchmark in the perplexity example, and probably also the server example)
Find a better way to handle the KV cache size than by misusing metadata
- This is necessary to avoid breaking Mamba models when MambaFormer gets worked on
- But this requires defining new Mamba-specific key-value pairs in the GGUF metadata
  - I'm not sure how similar to other SSM's metadata is Mamba's metadata (specifically regarding d_conv, d_inner, d_state, and dt_rank)
State saving and restoring
Detect lack of tokenizer.json and use models/ggml-vocab-gpt-neox.gguf when converting a Mamba model to GGUF
- The resulting models are exactly equivalent as if tokenizer.json and tokenizer_config.json from GPT-NeoX were in the model directory.
Remove redundant comments

Out of scope for this PR

GPU kernels for ggml_ssm_conv and ggml_ssm_scan
- CPU-only for now
Support for mixed recurrent Transformer models (like the MambaFormer architecture)
- For now, it's acceptable to reuse the K and V tensors of the KV cache for Mamba's states, but when MambaFormer gets worked on, it's going to be necessary to use other tensors for the states.
Mamba support for speculative and lookahead examples
- These use sequences in a complicated way. Will need further thought for Mamba. Also, at the time of writing, they don't support a bigger input prompt than the batch size, so I think making them work with Mamba is better done in a separate PR after this is fixed.
Better quantization mixes
- Knowing which linear projection tensors are more important will need experimentation
- For now, Mamba-specific weights which are not big linear projections are kept as f32. So Q4_K_M takes 5.76 bits per weight with Mamba 2.8B
State backtracking
- Would be very useful to reduce prompt reprocessing in the server example
- Needs further research

References

The Mamba paper
- https://arxiv.org/abs/2312.00752
Huge inspiration for initial implementation
- https://github.com/kroggen/mamba.c
Another minimal Mamba implementation
- https://github.com/johnma2006/mamba-minimal/blob/master/model.py
The official Mamba implementation
- https://github.com/state-spaces/mamba
Official Mamba model weights
- https://huggingface.co/state-spaces
- If you want to convert them, edit their config.json to add "architectures": ["MambaForCausalLM"], then use python3 convert-hf-to-gguf.py ../path/to/mamba-130m/ with the options you want (see --help) and the correct path.
Official Mamba model weights, but easier to convert, and with shared token_embd.weight and output.weight, so the download is a bit smaller than from the other repositories
- https://huggingface.co/collections/state-spaces/transformers-compatible-mamba-65e7b40ab87e5297e45ae406
Mamba models, but (re-exported? (the layer names are slightly different)) with a more complete config.json and the presence of tokenizer.json.
- https://huggingface.co/collections/Q-bert/mamba-65869481595e25821853d20d

FSSRepo · 2024-02-05T03:28:30Z

@compilade ust out of curiosity, is any convolution operation performed? I see some tensors with the name conv, but I never see ggml_conv_1d or ggml_conv_2d being used at any point.

compilade · 2024-02-05T04:06:27Z

@compilade ust out of curiosity, is any convolution operation performed? I see some tensors with the name conv, but I never see ggml_conv_1d or ggml_conv_2d being used at any point.

@FSSRepo
Well, at some point I tried to use ggml_conv_1d for this, but Mamba uses a number of groups equal to the number of in_channels, and ggml_conv_1d does not support setting the number of groups (at least, from what I understood when trying to make it work).

But it turns out that the desired operation in this case is exactly equivalent to making a self-overlapping view which shifts by one column at each stride in the 3rd dimension (which corresponds here to the number of tokens in the batch), and then doing a matrix multiplication with the conv1d weight of Mamba over the d_conv dimension (the kernel size of the convolution, which is 4). That matrix multiplication is done with ggml_mul and ggml_sum_rows because that way each row of the x tensor is still contiguous after permuting away the 1-sized dimension.

Not sure if I'm explaining this clearly, because I did not really know anything about convolutions before working on this.

(Here are the relevant lines for the "conv" step in my implementation.)

I figured this out when thinking about how to process multiple tokens at a time in the "conv" step when starting from how the next conv_state is built one token at a time in mamba.c and the corresponding lines in the official "simple" implementation. What I ended up with is much simpler than what I initially thought would have been necessary for batch processing.

ggerganov

Cool!

Turns out that implementing support for a novel model architecture is quite fun (well, at least when it finally works).

Glad to hear 😄

Regarding the KV questions:
IIUC one slot is needed per sequence, so in that sense the KV cache size could be interpreted as the maximum number of distinct sequences that can be processed simultaneously.

GPU kernels likely for future PRs

ggml.c

llama.cpp

compilade · 2024-02-05T21:20:01Z

Regarding the KV questions:
IIUC one slot is needed per sequence, so in that sense the KV cache size could be interpreted as the maximum number of distinct sequences that can be processed simultaneously.

(What follows are some thoughts about where the number of distinct sequences should be taken from. TL;DR at the end.)

For Mamba 3B, each KV slot takes 23.75 MiB. If the value is taken from n_ctx, since the default value is 512, the KV cache would take 512 times 23.75 MiB, which is 11.875 GiB, an unacceptably large amount of used memory (especially since most people won't use anywhere near 512 distinct sequences at the same time).
Also, even if somehow a very small n_ctx is used with Mamba, almost everything currently seems to expect the input prompt(s) to never be bigger than the context size, which complicates this solution. But at least, since quite a lot of memory calculations are based on n_ctx, less initialization code has to be changed.

So, let's say I instead take the max number of distinct sequences from the value of n_parallel, then that value would need to be available at KV cache initialization (easy, this means adding it to llama_context_params to access it from llama_new_context_with_model and then in llama_kv_cache_init).
The current default value of 1 here is reasonable, but for servers, n_parallel should be at least as big as the number of users, or it won't work properly (is it already like this for Transformer-based models?).

But then I have to change how some things are initialized, replacing cparams.n_ctx with kv_self.size in a bunch of places where the assumption was that they are the same (but they aren't with Mamba).
I think that's the better way, since it would also make it easier to use the KV cache differently than even what Mamba and Transformers do. If there's ever a non-linear and non-constant way to fill the KV cache, it should be easier to implement after this change.

Another thing regarding n_parallel and the KV cache size: even for Transformers, it could be useful to make the KV cache size a multiple of n_ctx, which would make the n_ctx per client slot in the server example easier to reason about (each would simply be equal to the global n_ctx). Though it would make it harder at a glance to see the total context size.
I'm not sure what users expect when setting the --ctx-size in conjunction with --parallel regarding memory usage (currently, each client slot gets a fraction of the specified --ctx-size).
I presume it was done this way because the KV cache size had always been set from n_ctx.
In any case, it's probably best to avoid making user-facing breaking changes like this for Transformer-based models in this PR, though, so I'll leave this idea unimplemented for now.

TL;DR:

I'll try to make Mamba's KV cache size proportional to n_parallel as it seems to be the appropriate parameter to get the max number of distinct sequences processed at once.

compilade · 2024-02-09T00:17:18Z

I've been thinking about what parts of the KV cache API can and cannot be supported for Mamba.

In general, functions which operate on whole sequences or the whole KV cache can be relatively easily supported.

But a lot of KV cache API functions take a range of token positions, and this cannot easily work with Mamba (too many states would need to be kept unnecessarily).

Function	Can be supported	Acceptable
`llama_kv_cache_clear`	Yes	Yes
`llama_kv_cache_seq_rm`	Partially	No
`llama_kv_cache_seq_cp`	Partially	Yes
`llama_kv_cache_seq_keep`	Yes	Yes
`llama_kv_cache_seq_shift`	No	Yes
`llama_kv_cache_seq_div`	No	Yes

Here, "Partially" means "Only on entire sequences" (all tokens of a sequence, regardless of their position).

The most problematic function is llama_kv_cache_seq_rm, which is used in the server example to clear tokens after the system prompt.
This could be worked around by dedicating a seq_id to the system prompt, and then using llama_kv_cache_seq_cp to copy over the system prompt to the other sequences when it's needed. The seq_ids for the client slots would need to be offset, though.

I think that most of what is currently done with position ranges (when using llama_kv_cache_seq_cp and llama_kv_cache_seq_rm) could be done with better sequence management.

This is kind of a blocker for Mamba support in llama.cpp, but it can wait. I need to finish trying to make multiple independent sequences work with Mamba before this can be useful to fix.

ggerganov · 2024-02-09T12:27:20Z

This could be worked around by dedicating a seq_id to the system prompt, and then using llama_kv_cache_seq_cp to copy over the system prompt to the other sequences when it's needed. The seq_ids for the client slots would need to be offset, though.

Yes, that sounds like the right way to do it

I think that most of what is currently done with position ranges (when using llama_kv_cache_seq_cp and llama_kv_cache_seq_rm) could be done with better sequence management.

More thoughts on this are welcome

compilade · 2024-02-22T03:53:44Z

Now that multiple sequences can be processed at once, I've been trying to make the server example work with Mamba.

I think that most of what is currently done with position ranges (when using llama_kv_cache_seq_cp and llama_kv_cache_seq_rm) could be done with better sequence management.

More thoughts on this are welcome

I think I was wrong. Some uses of position ranges do seem necessary. The server example currently uses llama_kv_cache_seq_rm with position ranges to keep the common part between the cached tokens and the requested prompt. The default web example seems to trim one or two token at the end of each response, which means the "correct" way to do this with Mamba is to reprocess the whole prompt when it's not completely the same as the cached tokens (which is not ideal and could give a bad first impression of Mamba replying slower over time because of the re-processing time).

I'm wondering if it's okay to make llama_kv_cache_seq_rm return a bool (instead of void, changing the public API for that function) to let the caller know whether the removal succeeded or not (it would only be fallible for Mamba for now). This way, the server can still try to trim the prompt when it can, and fallback to re-process everything after the system prompt when it can't.

But I've been thinking of a way to calculate previous states from more recent ones. From the (2a) equation of the paper, which looks like what is done in ggml_ssm_scan, the next ssm_states is calculated thusly :

$$h_t = \mathbfit{\overline{A}}{h_{t-1}} + \mathbfit{\overline{B}}x_t$$

Solving for $h_{t-1}$, it should be possible to get the previous ssm_states :

$$h_{t-1} = \frac{h_t - \mathbfit{\overline{B}}x_t}{\mathbfit{\overline{A}}}$$

But getting the previous conv_states (which is also necessary to get that $x_t$) requires the fourth last processed token (4 comes from the value of d_conv in the Mamba models). So a list of previous tokens would need to be kept to make the states go further back (still much lighter on memory use than actually keeping the states).

But I'm not sure how to integrate that with the forward pass. Should the roll-back be done at the next llama_decode after llama_kv_cache_seq_rm, or right away? (making llama_kv_cache_seq_rm sometimes slow) This could be handled similarly to KV cache defrag. Where could the token list be stored which won't unnecessarily reserve memory for models which won't need it? What about needing to go further back than the batch size? Which time would "unprocessing" tokens count towards? Is rolling-back the states even possible? (probably not, the z tensor (the right branch of the Mamba architecture in Figure 3 of the Mamba paper) might make this impossible, or at least more complicated than the above)

These are questions I'll ponder during the next month (so probably in another PR), after I make the parallel and server examples work out-of-the-box with Mamba in the coming weekend (right now they work, but not as-is).

It's still slower than I'd like, but I did not really optimize `ggml_exp` yet. I also refactored `ggml_exp` to work with tensors with more than 2 dimensions.

This results in 8% faster token generation for Mamba-130M.

Turns out the conv_state can be made smaller by one column. Note that this breaks existing GGUFs of Mamba, because the key_value_length field is tied to the conv_state size. Convolution with a self-overlapping view is cool! And it's much simpler than what I initially thought would be necessary to make the convolution step work with more than 1 token at a time. Next step is to make the SSM step work on batches of tokens too, and thus I need to figure out a way to make a parallel selective scan which will keep the ssm_state small and won't make it bigger by a factor of (n_layer * batch_size). * llama : fix Mamba KV self size wrongly displaying as f16 instead of f32 Relatedly, I also tried to see if other types than f32 worked for the states, but they don't, because of the operators used. It's probably better anyway to keep lots of precision there, since the states are small anyway.

This means running Mamba no longer crashes when using the default settings! And probably also slightly faster prompt processing. Both batched and non-batched processing yield the same output. Previously, the state was not cleared when starting a sequence. Next step is to make the KV cache API work as expected for Mamba models. * ggml: add ggml_ssm_scan to help with parallel selective scan If the selective scan was implemented without a custom operator, there would be waaay too many nodes in the graph. For example, for Mamba-130M, with a batch size of 512 (the default), a naive selective scan could add at least 24*512=12288 nodes, which is more than LLAMA_MAX_NODES (8192), and that's only for the smallest Mamba model. So it's much cleaner with a custom operator. Not sure about the name, though.

This will help with performance on CPU if ggml_vec_mul_f32 and ggml_vec_add_f32 are ever optimized with SIMD.

Mostly works, but there is currently no difference between the variants of a k-quant (e.g. Q4_K_S and Q4_K_M are the same). Most of the SSM-specific weights can be kept in f32 without affecting the size that much, since they are relatively small. (the linear projection weights are responsible for most of Mamba's size) Too much quantization seems to make the state degrade quite fast, and the model begins to output gibberish. It seems to affect bigger models to a lesser extent than small models, but I'm not sure by how much. Experimentation will be needed to figure out which weights are more important for the _M (and _L?) variants of k-quants for Mamba. * convert : fix wrong name for layer norm weight of offical Mamba models I was using Q-bert/Mamba-* models before, which have a slighlty different naming scheme for the weights. (they start with "model.layers" instead of "backbone.layers")

This increases performance on CPU by around 30% for prompt processing, and by around 20% for text generation. However, it also makes the ggml_exp and ggml_soft_plus operators unused. Whether or not they should be kept will be decided later.

It's the name of the class of the official implementation, though they don't use it (yet) in the "architectures" field of config.json

ggerganov

The implementation is pretty good

I'm still not convinced we need to introduce n_parallel and llama_n_max_seq(). I did some tests using just n_ctx and things seems to work OK. Only the self attention input buffers (such as KQ_mask and KQ_pos) depend on n_ctx (and now kv_size), but these are not used for Mamba, so we won't be over-allocating. If in some places we expect the input to not be big bigger than n_ctx (such as the context shift logic), we can try to fix these (simply disable context shift for Mamba models).

Even if the examples with default arguments are not suitable for Mamba (i.e. n_ctx = 512), it's not a big problem. As long as it is just a matter of adjusting some of the CLI args, I think it is good

Either way, we can merge it as it is since the API change is quite small

llama.cpp

Otherwise, when the "we have to evaluate at least 1 token" special case was triggered, an extra token was kept in cache_tokens even if it was removed from the KV cache. For Mamba, this caused useless prompt reprocessing when the previous request triggered the above case.

compilade · 2024-03-07T21:52:42Z

The implementation is pretty good

Thanks!

I'm still not convinced we need to introduce n_parallel and llama_n_max_seq().

Imagine the following case: A user wants to use Mamba 3B to process a prompt with a length of... 1337 tokens. This user is only using a single sequence. Out of habit with how other models work, the user passes --ctx-size 2048.

Now, the two ways to do this:

With the number of sequences coming from n_parallel (how it's done in this PR for Mamba)
- A single sequence is allocated (this uses 23.75 MiB of RAM) ✅
- llama_batch_init() creates a buffer big enough for 2048 tokens ✅
- Everything goes well since 1337 tokens can fit in the 2048-tokens buffer. ✅
With the number of sequences coming from n_ctx
- 2048 sequences are allocated (this uses 47.5 GiB of RAM) ❗
- llama_batch_init() creates a buffer big enough for 2048 tokens. ✅
- Everything goes well except if the user doesn't have at least 64 GiB of RAM. 🟡

Okay that was unfair. Let's say the user is better-informed and passes --ctx-size 1 or that n_ctx is somehow 1 (e.g. from different defaults).

With the number of sequences coming from n_parallel or from n_ctx (both are 1 here so the result is the same):
- A single sequence is allocated (this uses 23.75 MiB of RAM) ✅
- llama_batch_init() creates a buffer big enough for 1 token. ❗
- Buffer overflow when the 1337 tokens are added to the 1-token batch buffer. ❌

I don't really see from where else than n_ctx the size of the batch buffer (allocated with llama_batch_init()) could come from (especially since this is about the batch buffer before it's split-into-n_batch-sized-parts).

Using n_parallel for the number of sequences was overall easier than trying to change the meaning of n_ctx.

The same reasoning also applies for examples like perplexity and parallel.

I hope this better explains why the context size and the number of sequences were made orthogonal for Mamba.
(note that unlike with Mamba, llama_n_ctx() and llama_n_max_seq() are equivalent for Transformer-based models)

If in some places we expect the input to not be big bigger than n_ctx (such as the context shift logic), we can try to fix these (simply disable context shift for Mamba models).

These checks are also used to avoid overflowing the buffer allocated with llama_batch_init().

Currently, context shifting is faked for recurrent models to let n_past be made smaller than n_ctx, while still getting consecutive token positions. (though ideally n_ctx should be big enough for this to never happen, hence the current very big (2**20 (1048576)) default context length stored in the model's metadata for Mamba in convert-hf-to-gguf.py (even though maybe only LongMamba could make use of it))

Either way, we can merge it as it is since the API change is quite small

I agree.
Note that I've adapted my changes for Mamba to the recent refactor of the server example.
The only "weird" things I'm (still) doing there are

using slot.id + 1 as the seq_id of each slot (since the system prompt uses the KV cache's seq_id 0)
- this might be confusing, but at least the external behavior stays the same (i.e. slot id 0 exists)
a little dance with n_parallel to add 1 to it to reserve a sequence id for the system prompt, initialize the model, then remove 1 from n_parallel so that it can be used with the same meaning as before as the number of client slots.
checking for the failure of llama_kv_cache_seq_rm() when removing the part of the cache which is not common with the prompt, because recurrent models (currently) can't have their states partially reset.

compilade · 2024-03-08T01:18:52Z

Since the transformers library is getting support for Mamba (huggingface/transformers#28094), the official Mamba models have been re-released with more metadata.

See https://huggingface.co/collections/state-spaces/transformers-compatible-mamba-65e7b40ab87e5297e45ae406

I think I should rename the GGUF key-value pairs I added for Mamba to make them more similar to their transformers counterpart.

Current GGUF name	`transformers` name	Possible new GGUF name
`{arch}.ssm.d_conv`	`conv_kernel`	`{arch}.ssm.conv_kernel`
`{arch}.ssm.d_state`	`state_size`	`{arch}.ssm.state_size`
`{arch}.ssm.d_inner`	`intermediate_size`	`{arch}.ssm.inner_size`
`{arch}.ssm.dt_rank`	`time_step_rank`	`{arch}.ssm.time_step_rank`

This would break existing GGUF-converted Mamba models, though.
(none have been published yet it seems, so those affected should easily be able to reconvert)
If I rename them, it needs to happen before merging.

(EDIT: the above change has been done. If there are any objections, I'd like to know)

For the models available at https://huggingface.co/collections/state-spaces/transformers-compatible-mamba-65e7b40ab87e5297e45ae406

This breaks existing converted-to-GGUF models, but the metadata names are more "standard". mamba : support mamba-*-hf models These models share their token_embd.weight with their output.weight

This is purely a formatting change.

ggerganov · 2024-03-08T09:08:27Z

I don't really see from where else than n_ctx the size of the batch buffer (allocated with llama_batch_init()) could come from (especially since this is about the batch buffer before it's split-into-n_batch-sized-parts).

Thanks, I agree now. We should actually start using llama_n_max_seq() instead of n_ctx to init batches in the examples to make it more semantically clear. We can do this in another PR

Feel free to merge this (squash in single commit) when you think it is ready. Maybe add a short notice in the "Recent API changes" section in the README.md to help 3rd party devs and consider updating the GGUF spec with the new keys

Only for Mamba for now, but it might be relevant for other models eventually. Most Mamba models actually share these two tensors, albeit implicitly.

compilade · 2024-03-08T16:00:40Z

We should actually start using llama_n_max_seq() instead of n_ctx to init batches in the examples to make it more semantically clear.

There might be a misunderstanding here. To be clear, llama_n_max_seq() returns the upper limit of acceptable seq_id in batches. This is only relevant when dealing with multiple sequences.

What caused llama_n_max_seq() to exist is the perplexity example, which creates a lot of sequences, especially in the HellaSwag benchmark. I needed this limit to make it avoid using sequences ids that could not fit in Mamba's KV cache.

Unless an example really uses ALL available sequences on any single token in a batch, llama_n_max_seq() should not be used when initializing batches. n_ctx is not replaced by this.

Feel free to merge this (squash in single commit) when you think it is ready.

Noted.

A few tensors were also missing `struct` in front of `ggml_tensor`.

* mamba : begin working on support for Mamba SSM * mamba : begin figuring out how to (ab)use the kv cache for Mamba * mamba : recurrent inference almost works, but incoherent * mamba : recurrent inference WORKS!!! * convert : optionally use d_conv and d_state from config.json for Mamba * mamba : refactor recurrent conv, resulting in 20% perf increase It's still slower than I'd like, but I did not really optimize `ggml_exp` yet. I also refactored `ggml_exp` to work with tensors with more than 2 dimensions. * ggml : parallelize ggml_exp This results in 8% faster token generation for Mamba-130M. * mamba : simplify the conv step with a self-overlapping view Turns out the conv_state can be made smaller by one column. Note that this breaks existing GGUFs of Mamba, because the key_value_length field is tied to the conv_state size. Convolution with a self-overlapping view is cool! And it's much simpler than what I initially thought would be necessary to make the convolution step work with more than 1 token at a time. Next step is to make the SSM step work on batches of tokens too, and thus I need to figure out a way to make a parallel selective scan which will keep the ssm_state small and won't make it bigger by a factor of (n_layer * batch_size). * llama : fix Mamba KV self size wrongly displaying as f16 instead of f32 Relatedly, I also tried to see if other types than f32 worked for the states, but they don't, because of the operators used. It's probably better anyway to keep lots of precision there, since the states are small anyway. * mamba : fix self-overlapping view depth stride * mamba : handle batches of more than 1 token This means running Mamba no longer crashes when using the default settings! And probably also slightly faster prompt processing. Both batched and non-batched processing yield the same output. Previously, the state was not cleared when starting a sequence. Next step is to make the KV cache API work as expected for Mamba models. * ggml: add ggml_ssm_scan to help with parallel selective scan If the selective scan was implemented without a custom operator, there would be waaay too many nodes in the graph. For example, for Mamba-130M, with a batch size of 512 (the default), a naive selective scan could add at least 24*512=12288 nodes, which is more than LLAMA_MAX_NODES (8192), and that's only for the smallest Mamba model. So it's much cleaner with a custom operator. Not sure about the name, though. * ggml : in ggml_ssm_scan, merge multiple rows in the same vec operation This will help with performance on CPU if ggml_vec_mul_f32 and ggml_vec_add_f32 are ever optimized with SIMD. * mamba : very basic quantization support Mostly works, but there is currently no difference between the variants of a k-quant (e.g. Q4_K_S and Q4_K_M are the same). Most of the SSM-specific weights can be kept in f32 without affecting the size that much, since they are relatively small. (the linear projection weights are responsible for most of Mamba's size) Too much quantization seems to make the state degrade quite fast, and the model begins to output gibberish. It seems to affect bigger models to a lesser extent than small models, but I'm not sure by how much. Experimentation will be needed to figure out which weights are more important for the _M (and _L?) variants of k-quants for Mamba. * convert : fix wrong name for layer norm weight of offical Mamba models I was using Q-bert/Mamba-* models before, which have a slighlty different naming scheme for the weights. (they start with "model.layers" instead of "backbone.layers") * mamba : fuse more steps of the SSM scan in the ggml_ssm_scan operator This increases performance on CPU by around 30% for prompt processing, and by around 20% for text generation. However, it also makes the ggml_exp and ggml_soft_plus operators unused. Whether or not they should be kept will be decided later. * convert : for Mamba, also consider the "MambaLMHeadModel" arch name It's the name of the class of the official implementation, though they don't use it (yet) in the "architectures" field of config.json * mamba : fix vocab size problems with official models The perplexity was waaaay to high for models with a non-round vocab size. Not sure why, but it needed to be fixed in the metadata. Note that this breaks existing GGUF-converted Mamba models, but **only if** the vocab size was not already rounded. * ggml : remove ggml_exp and ggml_soft_plus They did not exist anyway outside of this branch, and since ggml_ssm_scan fused operations together, they are unused. It's always possible to bring them back if needed. * mamba : remove some useless comments No code change. * convert : fix flake8 linter errors * mamba : apply suggestions from code review * mamba : remove unecessary branch for row-wise ssm_state and C multiplication It was previously done to avoid permuting when only one token is processed at a time (like when generating text), but permuting is cheap, and dynamically changing the compute graph is not future-proof. * ggml : in ggml_ssm_scan, use more appropriate asserts * ggml : rename the destination pointer in ggml_compute_forward_ssm_scan_f32 * mamba : multiple sequences, but one at a time This is a step towards making this Mamba implementation usable with the server example (the way the system prompt is kept when clearing the client slots will need to be changed before this can work, though). The KV cache size for this kind of model is tied to the maximum number of sequences kept at any single time. For now, this number is obtained from n_parallel (plus one, to have an extra sequence to dedicate to the system prompt), but there might be a better way to do this which won't also make the main example use 2 cells even if only 1 is really used. (for this specific case, --parallel 0 helps) Simultaneous sequence processing will probably require changes to ggml_ssm_scan, and possibly a new operator for the conv step. * mamba : support llama_kv_cache_seq_cp This (mis)uses the logic around K shifts, because tokens in a state can't be shifted anyway, and because inp_K_shift has the right shape and type. Using ggml_get_rows is a nice way to do copies, but copy chains can't work. Fortunately, copy chains don't really seem to be used in the examples. Each KV cell is dedicated to the sequence ID corresponding to its own index. * mamba : use a state mask It's cleaner than the previous heuristic of checking for the pos of the first token in the batch. inp_KQ_mask could not be re-used for this, because it has the wrong shape and because it seems more suited to the next step of simultaneous sequence processing (helping with the problem of remembering which token belongs to which sequence(s)/state(s)). * llama : replace the usage of n_ctx with kv_self.size in many places * mamba : use n_tokens directly instead of n_tok * mamba : in comments, properly refer to KV cells instead of slots * mamba : reduce memory usage of ggml_ssm_scan From 290.37 MiB to 140.68 MiB of CPU compute buffer size with Mamba 3B with a batch size of 512. The result tensor of ggml_ssm_scan was previously a big part of the CPU compute buffer size. To make it smaller, it does not contain the intermediate ssm states anymore. Both y and the last ssm state are combined in the result tensor, because it seems only a single tensor can be returned by an operator with the way the graph is built. * mamba : simultaneous sequence processing A batch can now contain tokens from multiple sequences. This is necessary for at least the parallel example, the server example, and the HellaSwag test in the perplexity example. However, for this to be useful, uses of llama_kv_cache_seq_rm/cp will need to be changed to work on whole sequences. * ggml : add ggml_ssm_conv as a new operator for the conv step of Mamba This operator makes it possible to use and update the correct states for each token of the batch in the same way as ggml_ssm_scan. Other solutions which use existing operators would need loops which would add too many nodes to the graph (at least the ones I thought of). Using this operator further reduces the size of the CPU compute buffer from 140.68 MiB to 103.20 MiB with Mamba 3B with a batch size of 512. And (at least on CPU), it's a bit faster than before. Note that "ggml_ssm_conv" is probably not the most appropriate name, and it could be changed if a better one is found. * llama : add inp_s_seq as a new input tensor The most convenient implementation to select the correct state (for Mamba) for each token is to directly get the correct index from a tensor. This is why inp_s_seq is storing int32_t and not floats. The other, less convenient way to select the correct state would be to have inp_KQ_mask contain 1.0f for each state used by a token and 0.0f otherwise. This complicates quickly fetching the first used state of a token, and is also less efficient because a whole row of the mask would always need to be read for each token. Using indexes makes it easy to stop searching when there are no more sequences for a token, and the first sequence assigned is always very quickly available (it's the first element of each row). * mamba : support llama_kv_cache_seq_cp copy chains * mamba : support shifting and dividing the kv cache pos * mamba : make the server and parallel examples work with whole sequences A seq_id is dedicated to the system prompt in both cases. * llama : make llama_kv_cache_seq_rm return whether it succeeded or not * mamba : dedicate an input tensor for state copy indices This is cleaner and makes it easier to adapt when/if token positions (and by extension, inp_K_shift) are no longer integers. * mamba : adapt perplexity, batched, and batched-bench examples * perplexity : limit the max number of sequences This adapts to what the loaded model can provide. * llama : add llama_n_max_seq to get the upper limit for seq_ids Used by the perplexity example. * batched : pass n_parallel to the model's context params This should have been there already, but it wasn't. * batched-bench : reserve sequences to support Mamba * batched-bench : fix tokens being put in wrong sequences Generation quality isn't what's measured in there anyway, but at least using the correct sequences avoids using non-consecutive token positions. * mamba : stop abusing attention metadata This breaks existing converted-to-GGUF Mamba models, but will allow supporting mixed architectures like MambaFormer without needing to break Mamba models. This will also allow changing the size of Mamba's states without having to reconvert models in the future. (e.g. using something else than d_conv - 1 columns for the conv_states will not require breaking existing converted Mamba models again) * gguf-py : add new KV metadata key-value pairs for Mamba * llama : add new metadata key-value pairs for Mamba * llama : guard against divisions by zero when n_head is 0 * mamba : rename "unlimited" KV cache property to "recurrent" * mamba : more correctly update the "used" field of the KV cache * ggml : in ggml_ssm_scan, use a threshold for soft_plus This is how the official Mamba implementation does it, and it's also what torch.nn.Softplus does. * convert : for Mamba, fallback to internal NeoX tokenizer The resulting models are exactly the same as if the tokenizer.json and tokenizer_config.json of GPT-NeoX were there. * mamba : support state saving and restoring * ggml : implicitly pass src tensors through dst for Mamba-related ops * mamba : clarify some comments * server : fix cache_tokens not getting correctly resized Otherwise, when the "we have to evaluate at least 1 token" special case was triggered, an extra token was kept in cache_tokens even if it was removed from the KV cache. For Mamba, this caused useless prompt reprocessing when the previous request triggered the above case. * convert-hf : support new metadata keys for Mamba For the models available at https://huggingface.co/collections/state-spaces/transformers-compatible-mamba-65e7b40ab87e5297e45ae406 * mamba : rename metadata to be more similar to transformers library This breaks existing converted-to-GGUF models, but the metadata names are more "standard". * mamba : support mamba-*-hf models These models share their token_embd.weight with their output.weight * mamba : add missing spaces This is purely a formatting change. * convert-hf : omit output.weight when identical with token_embd.weight Only for Mamba for now, but it might be relevant for other models eventually. Most Mamba models actually share these two tensors, albeit implicitly. * readme : add Mamba to supported models, and add recent API changes * mamba : move state_seq and state_mask views outside layer loop A few tensors were also missing `struct` in front of `ggml_tensor`.

gguf-py/gguf/tensor_mapping.py

compilade marked this pull request as draft February 5, 2024 01:09

ggerganov reviewed Feb 5, 2024

View reviewed changes

ggml.c Outdated Show resolved Hide resolved

ggml.c Outdated Show resolved Hide resolved

ggml.c Outdated Show resolved Hide resolved

llama.cpp Outdated Show resolved Hide resolved

llama.cpp Outdated Show resolved Hide resolved

compilade force-pushed the support-mamba-ssm branch from 9c4c257 to 322686e Compare February 14, 2024 20:31

compilade mentioned this pull request Feb 21, 2024

llama : rename n_ctx to kv_size #5568

Closed

compilade force-pushed the support-mamba-ssm branch 3 times, most recently from 3421d17 to 7b1ff55 Compare February 26, 2024 19:18

compilade mentioned this pull request Feb 27, 2024

llama : fix non-quantization of expert gating tensors #5754

Merged

compilade force-pushed the support-mamba-ssm branch 2 times, most recently from 8646535 to fad8848 Compare March 2, 2024 16:52

compilade added 14 commits March 3, 2024 11:28

mamba : begin working on support for Mamba SSM

8cd0a28

mamba : begin figuring out how to (ab)use the kv cache for Mamba

5a69a26

mamba : recurrent inference almost works, but incoherent

f680364

mamba : recurrent inference WORKS!!!

54d3e48

convert : optionally use d_conv and d_state from config.json for Mamba

74eea85

mamba : refactor recurrent conv, resulting in 20% perf increase

9e77061

It's still slower than I'd like, but I did not really optimize `ggml_exp` yet. I also refactored `ggml_exp` to work with tensors with more than 2 dimensions.

ggml : parallelize ggml_exp

3f7233b

This results in 8% faster token generation for Mamba-130M.

mamba : fix self-overlapping view depth stride

81b57bb

ggml : in ggml_ssm_scan, merge multiple rows in the same vec operation

78a853b

This will help with performance on CPU if ggml_vec_mul_f32 and ggml_vec_add_f32 are ever optimized with SIMD.

convert : for Mamba, also consider the "MambaLMHeadModel" arch name

9f55809

It's the name of the class of the official implementation, though they don't use it (yet) in the "architectures" field of config.json

ggerganov approved these changes Mar 7, 2024

View reviewed changes

llama.cpp Show resolved Hide resolved

compilade added 2 commits March 7, 2024 14:15

Merge branch 'master' into support-mamba-ssm

916b586

compilade added 3 commits March 7, 2024 20:28

convert-hf : support new metadata keys for Mamba

d8024a4

For the models available at https://huggingface.co/collections/state-spaces/transformers-compatible-mamba-65e7b40ab87e5297e45ae406

mamba : rename metadata to be more similar to transformers library

17e4d6c

This breaks existing converted-to-GGUF models, but the metadata names are more "standard". mamba : support mamba-*-hf models These models share their token_embd.weight with their output.weight

mamba : add missing spaces

1c8ea55

This is purely a formatting change.

convert-hf : omit output.weight when identical with token_embd.weight

d0d32dc

Only for Mamba for now, but it might be relevant for other models eventually. Most Mamba models actually share these two tensors, albeit implicitly.

compilade added 2 commits March 8, 2024 11:03

readme : add Mamba to supported models, and add recent API changes

3e5685f

mamba : move state_seq and state_mask views outside layer loop

39579d3

A few tensors were also missing `struct` in front of `ggml_tensor`.

compilade merged commit c2101a2 into ggerganov:master Mar 8, 2024
61 checks passed

compilade mentioned this pull request Mar 9, 2024

perplexity : support using multiple sequences to allow larger batch sizes #5946

Merged

MarcellM01 mentioned this pull request Mar 9, 2024

Mamba State Space Models Integration ollama/ollama#3023

Open

RookieIndieDev mentioned this pull request Mar 11, 2024

Possibility of using Mamba SSM Mobile-Artificial-Intelligence/maid#390

Closed

compilade mentioned this pull request Mar 12, 2024

gguf : add Mamba keys and tensors ggerganov/ggml#763

Merged

johnnynunez mentioned this pull request Mar 23, 2024

Mamba (State Spaces Models) dusty-nv/jetson-containers#447

Closed

maziyarpanahi mentioned this pull request Mar 29, 2024

Suport for Jamba JambaForCausalLM #6372

Open

4 tasks

cold-blue reviewed Apr 8, 2024

View reviewed changes

gguf-py/gguf/tensor_mapping.py Show resolved Hide resolved

ggerganov mentioned this pull request Apr 8, 2024

llama : fix attention layer count sanity check #6550

Merged

ggerganov mentioned this pull request Apr 19, 2024

ggml : add GPU support for Mamba models #6758

Open

JoaoVictorVP mentioned this pull request Apr 25, 2024

Mamba SciSharp/LLamaSharp#694

Closed

compilade mentioned this pull request May 26, 2024

llama : support Jamba hybrid Transformer-Mamba models #7531

Draft

17 tasks

manny-pi mentioned this pull request Sep 27, 2024

Error: llama_model_load: error loading model: failed to open ggml-bagel-2.8b-v0.2-q8_0.gguf #9656

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : support Mamba Selective State Space Models #5328

llama : support Mamba Selective State Space Models #5328

compilade commented Feb 5, 2024 •

edited

Loading

FSSRepo commented Feb 5, 2024

compilade commented Feb 5, 2024 •

edited

Loading

ggerganov left a comment

compilade commented Feb 5, 2024

compilade commented Feb 9, 2024

ggerganov commented Feb 9, 2024

compilade commented Feb 22, 2024 •

edited

Loading

ggerganov left a comment

compilade commented Mar 7, 2024

compilade commented Mar 8, 2024 •

edited

Loading

ggerganov commented Mar 8, 2024

compilade commented Mar 8, 2024

llama : support Mamba Selective State Space Models #5328

llama : support Mamba Selective State Space Models #5328

Conversation

compilade commented Feb 5, 2024 • edited Loading

Design decisions

TODO

Out of scope for this PR

References

FSSRepo commented Feb 5, 2024

compilade commented Feb 5, 2024 • edited Loading

ggerganov left a comment

Choose a reason for hiding this comment

compilade commented Feb 5, 2024

compilade commented Feb 9, 2024

ggerganov commented Feb 9, 2024

compilade commented Feb 22, 2024 • edited Loading

ggerganov left a comment

Choose a reason for hiding this comment

compilade commented Mar 7, 2024

compilade commented Mar 8, 2024 • edited Loading

ggerganov commented Mar 8, 2024

compilade commented Mar 8, 2024

compilade commented Feb 5, 2024 •

edited

Loading

compilade commented Feb 5, 2024 •

edited

Loading

compilade commented Feb 22, 2024 •

edited

Loading

compilade commented Mar 8, 2024 •

edited

Loading