Skip to content

DRAFT: PR to check diff between mamba2-sync and Falcon-H1 #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 68 commits into
base: mamba2-sync
Choose a base branch
from

Conversation

younesbelkada
Copy link

No description provided.

gabe-l-hart and others added 30 commits June 12, 2025 13:54
Also, split llama_model_is_recurrent into llm_arch_is_recurrent in
llama-arch with llama_model_is_recurrent delegating to
llm_arch_is_recurrent. The same split is done for hybird. This is needed
because there are places where the llama_model has not yet been initialized
but we need to check if the model is recurrent (specifically for the
per-layer recurrent check array in hparams).

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…s in hparams

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…l is recurrent

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
The implementation of the hybrid cache intentionally does not specify the
types of the child caches, so there was a naming mismatch with these
predicate functions that used "hybrid" to imply "hybrid recurrent."

Branch: HybridCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HybridCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This follows the pattern in iswa where the two child caches are held
explicitly to support the case where a model requires a single attention
cache and a single recurrent cache where each layer uses exactly one of the
caches.

This is a rewrite of the more generic approach in the original hybrid cache
PR: ggml-org#13276

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This includes a refactor of the create_memory logic to avoid needing to use
the arch enum explicitly unless a model needs explicit cache instantiation
logic beyond the standard logic for recurrent, hybrid, unified, and iswa.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
NOTE: I intentionally did not add support for s_mask since it will be going
away soon

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…he interface

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…empt

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
No longer needed now that unified isn't also supporting recurrent

ggml-org#13979 (comment)

Branch: HybridRecurrentCache
Now that it's not used at all in the unified cache, we don't need to use
the layer index to zero it out for attention layers.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This is no longer needed now that there are separate implementations

ggml-org#13979 (comment)

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This should help support architectures like Falcon H1 where there is
overlap between layers that need attention and recurrent caches.

ggml-org#13979 (comment)

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
ggml-org#13979 (comment)

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…y state

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…ntion pattern

https://github.com/ggml-org/llama.cpp/pull/13979/files#r2141701738

This is a big overhaul to bring consistency between how inputs and per-
layer components are created for attention layers and recurrent layers. The
main changes are:

- Rename class llm_graph_input_s_copy -> llm_graph_input_rs
- Add a corresponding llm_graph_input_rs_hybrid_recurrent
- Rename build_inp_s_copy -> build_rs_inp_recurrent
- Add a corresponding build_rs_inp_hybrid_recurrent
- Rename build_recurrent_state -> build_rs to match build_attn w/
  llm_graph_input_rs * as the first input
- Add a corresponding overload of build_rs w/
  llm_graph_input_rs_hybrid_recurrent * as the first input
- Add a llm_graph_input_attn_kv_hybrid_recurrent analogous to
  llm_graph_input_attn_kv_unified
- Add a build_attn override that takes
  llm_graph_input_attn_kv_hybrid_recurrent * as the first input

This makes the two paradigms fully consistent. The main drawback is the
code duplication in the build_attn and build_rs implementations where the
only difference between implementations is how they cast the memory state.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* origin/compilade/mamba2: (27 commits)
ggml-cpu : reorder SVE FMA for consistency with other SIMD arches
ggml : fix mamba2 ssm scan when compiled with SVE
graph : fix recurrent state copies when avoiding copies
kv-cache : allow context shift for recurrent models
convert : avoid AutoConfig for Mamba and Mamba2 hparams
kv-cache : remove const_cast when setting inputs for s_copy
metal : single-user mamba2 inference works
metal : add missing args for nb references in ssm_scan_f32_group
metal : fix confusion between ; and ,
convert : fix flake8 lint
ggml : avoid multiply by D in GGML_OP_SSM_SCAN
ggml : remove unused fast broadcast path in GGML_MUL
metal : fix wrong number of tokens per sequence in SSM_SCAN
metal : fix SSM_SCAN state head offset
metal : add back n_seqs to SSM_SCAN args
metal : remove unused arguments for SSM_SCAN
metal : use log and exp instead of log1pf and expf in SSM_SCAN
metal : fix SSM_SCAN pipeline scope
metal : attempt to adapt SSM_SCAN for Mamba-2
llama : avoid redundant state copy for Mamba 1 and 2
...
Also, split llama_model_is_recurrent into llm_arch_is_recurrent in
llama-arch with llama_model_is_recurrent delegating to
llm_arch_is_recurrent. The same split is done for hybird. This is needed
because there are places where the llama_model has not yet been initialized
but we need to check if the model is recurrent (specifically for the
per-layer recurrent check array in hparams).

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
gabe-l-hart and others added 18 commits June 16, 2025 15:18
This is no longer needed now that there are separate implementations

ggml-org#13979 (comment)

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This should help support architectures like Falcon H1 where there is
overlap between layers that need attention and recurrent caches.

ggml-org#13979 (comment)

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
ggml-org#13979 (comment)

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…y state

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…ntion pattern

https://github.com/ggml-org/llama.cpp/pull/13979/files#r2141701738

This is a big overhaul to bring consistency between how inputs and per-
layer components are created for attention layers and recurrent layers. The
main changes are:

- Rename class llm_graph_input_s_copy -> llm_graph_input_rs
- Add a corresponding llm_graph_input_rs_hybrid_recurrent
- Rename build_inp_s_copy -> build_rs_inp_recurrent
- Add a corresponding build_rs_inp_hybrid_recurrent
- Rename build_recurrent_state -> build_rs to match build_attn w/
llm_graph_input_rs android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input
- Add a corresponding overload of build_rs w/
llm_graph_input_rs_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input
- Add a llm_graph_input_attn_kv_hybrid_recurrent analogous to
llm_graph_input_attn_kv_unified
- Add a build_attn override that takes
llm_graph_input_attn_kv_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input

This makes the two paradigms fully consistent. The main drawback is the
code duplication in the build_attn and build_rs implementations where the
only difference between implementations is how they cast the memory state.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Since initially writing this PR, the logic in the child state types changed
such that using the "init full" signature and keeping the ubatches on the
parent struct no longer worked.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…ostic

This reduces the code duplication between the different build_rs impls and
also retains a similar signature to the previous build_recurrent_state
method while standardizing on the input-dispatched build_rs implementation.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* origin/compilade/mamba2: (27 commits)
ggml-cpu : reorder SVE FMA for consistency with other SIMD arches
ggml : fix mamba2 ssm scan when compiled with SVE
graph : fix recurrent state copies when avoiding copies
kv-cache : allow context shift for recurrent models
convert : avoid AutoConfig for Mamba and Mamba2 hparams
kv-cache : remove const_cast when setting inputs for s_copy
metal : single-user mamba2 inference works
metal : add missing args for nb references in ssm_scan_f32_group
metal : fix confusion between ; and ,
convert : fix flake8 lint
ggml : avoid multiply by D in GGML_OP_SSM_SCAN
ggml : remove unused fast broadcast path in GGML_MUL
metal : fix wrong number of tokens per sequence in SSM_SCAN
metal : fix SSM_SCAN state head offset
metal : add back n_seqs to SSM_SCAN args
metal : remove unused arguments for SSM_SCAN
metal : use log and exp instead of log1pf and expf in SSM_SCAN
metal : fix SSM_SCAN pipeline scope
metal : attempt to adapt SSM_SCAN for Mamba-2
llama : avoid redundant state copy for Mamba 1 and 2
...
@@ -32,7 +32,7 @@ llama_kv_cache_hybrid_recurrent::llama_kv_cache_hybrid_recurrent(
kv_attn(new llama_kv_cache_unified(
model,
attn_filter == nullptr ?
[&](int32_t il) { return !model.hparams.recurrent_layer(il); }
[&](int32_t il) { return model.hparams.recurrent_layer(il); }
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only open question I have on my end is how we can refactor this so that it can also accept models where we have attn and ssm - do you have any idea how we can tackle this @gabe-l-hart ? 👀

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should already be doable, but it will need to be done in llama_model::create_memory. My intent with the refactor starting here is that there should be default behavior for models based on llm_arch_is_recurrent and llm_arch_is_hybrid_recurrent, but that individual models that need special initialization can add themselves to the switch statement before the default and initialize the correct kind of memory for themselves (although, it would be even nicer if there were simply a way to override create_memory for a given model type's builder class).

For H1, I think you would essentially copy the clause used to initialize the hybrid recurrent cache (here) and use the option to add overriding filters to make "always on" filters for both child cache types.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the pointer @gabe-l-hart ! Will look into that

std::fill(
hparams.recurrent_layer_arr.begin(),
hparams.recurrent_layer_arr.end(),
llm_arch_is_recurrent(ml.get_arch()));
true);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this is hard-coded here we'll need to change it

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My goal with hard-coding this here was to make this a "sane default" for models that don't have specific override logic. For Granite 4, I overwrite this here when parsing hparams.

@@ -604,6 +604,11 @@ ggml_tensor * llm_graph_context::build_ffn(
case LLM_FFN_PAR:
{
cur = build_lora_mm(gate, cur);

if (arch == LLM_ARCH_FALCON_H1) {
cur = ggml_scale(ctx0, cur, hparams.mlp_gate_multiplier);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on what the core maintainers say about the addition of these new hparams for H1, I think the best approach here would to simply check if hparams.mlp_gate_multiplier is 0 rather than doing an architecture check. I think in general, the repo is trying to consolidate the number of places that individual architecture enums are checked in order to slowly encapsulate the logic for a given architecture into a self-contained class (this is speculation, but it looks like the direction that things are headed).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, will do that

uint32_t ssm_head_dim = 0;
uint32_t ssm_mamba_d_ssm = 0;

// Falcon-H1 specific parameters
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely interesting. My guess is that there will be some pushback on adding this many new hparams to the central struct that are specific to a single architecture. Many of these appear to have analogous params elsewhere in the struct, so the general preference is to reuse existing params (even if their names don't match with the HF names) and then do the name translation in the conver_hf_to_gguf.py / gguf-py portion.

ggml_build_forward_expand(gf, cur);
}

ggml_tensor * build_mamba2_layer(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in the same boat for Granite 4 where I've copy-pasted this build_mamba2_layer method. I've tried several other ways of making things more composable, but they each have their downsides:

  1. Making build_mamba_model::build_mamba*_layer methods static and then passing this when calling them. This is nice because other model builders can call them by simply calling build_mamba_model::build_mamba2_layer(this, ..., but it makes the method signature gross and causes the "child model type" to know about the "parent model type" which is a bit of an inversion of how the logical composition should go.
  2. Create a special base class that doesn't inherit from llm_graph_context as a mixin (llm_build_mamba_mixin) where it is initialized with a llm_graph_context * called self and then uses that in the implementations of the layer builders. This is nicer when using them from builders that need them because you simply have to inherit from the mixin class, initialize it with this, then call the method when needed like normal. The downside is that the mixin pattern isn't used anywhere else in the repo, so it would be a net-new design pattern that might add complexity.

I also tried direct diamond inheritance (llm_build_hybrid_mamba inherits from both llm_build_mamba and llm_build_granite which both inherit from llm_graph_context), but since the builder classes all implement their graph initialization in their constructors, this failed badly due to initializing all the models on the same graph.

At the moment, because of the complexity of these composition solutions, I'm sticking with the "copy-paste stuff" approach, but I'm going to keep poking at it to see if there's a composition approach that will actually feel clean for hybrid models like this that borrow layer builders from other models.

Copy link
Author

@younesbelkada younesbelkada Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I also faced this !
For H1 it's a bit more complex because we use d_ssm from the config / hparams whereas other models use d_inner. Also in our case we have this additional mup_vector that we use right after the proejction of ssm_proj so I decided to stick into the copy-paste approach

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I think it's a good argument in favor of copy-paste: Even if a lot of the logic is the same, it gives each model full ownership over exactly how that logic is wired up and avoids the challenges when a "parent model" makes changes that accidentally break a "child model" that depends on it. This is very similar to how transformers does it, though they've swung back a bit with the modular approach. The more I think about it, though, the more I think I'll stick with this for Granite 4 as well.

@younesbelkada
Copy link
Author

Thank you @gabe-l-hart for your review ! I also noticed that the hybrid cache PR got finally merged. Before addressing your comments, I was thinking of merging my branch with current llama.cpp master first, what do you think ?

@younesbelkada
Copy link
Author

It seems like Mamba2 might be close to landing: ggml-org#9126 (comment) I might also wait for that PR being merged on master and re-merge this branch into master

@gabe-l-hart
Copy link
Owner

Yep, that makes total sense. I'm going to do the same with Granite 4 and the hopefully finally point it at master!

@gabe-l-hart gabe-l-hart force-pushed the mamba2-sync branch 2 times, most recently from b605bb9 to fdc9a8d Compare June 27, 2025 17:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants