DRAFT: PR to check diff between `mamba2-sync` and `Falcon-H1` #1

younesbelkada · 2025-06-17T11:24:55Z

No description provided.

Also, split llama_model_is_recurrent into llm_arch_is_recurrent in llama-arch with llama_model_is_recurrent delegating to llm_arch_is_recurrent. The same split is done for hybird. This is needed because there are places where the llama_model has not yet been initialized but we need to check if the model is recurrent (specifically for the per-layer recurrent check array in hparams). Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Branch: GraniteFour

…s in hparams Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

…l is recurrent Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

The implementation of the hybrid cache intentionally does not specify the types of the child caches, so there was a naming mismatch with these predicate functions that used "hybrid" to imply "hybrid recurrent." Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

This follows the pattern in iswa where the two child caches are held explicitly to support the case where a model requires a single attention cache and a single recurrent cache where each layer uses exactly one of the caches. This is a rewrite of the more generic approach in the original hybrid cache PR: ggml-org#13276 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

This includes a refactor of the create_memory logic to avoid needing to use the arch enum explicitly unless a model needs explicit cache instantiation logic beyond the standard logic for recurrent, hybrid, unified, and iswa. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

NOTE: I intentionally did not add support for s_mask since it will be going away soon Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

…he interface Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

…empt Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

No longer needed now that unified isn't also supporting recurrent ggml-org#13979 (comment) Branch: HybridRecurrentCache

Now that it's not used at all in the unified cache, we don't need to use the layer index to zero it out for attention layers. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

This is no longer needed now that there are separate implementations ggml-org#13979 (comment) Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

This should help support architectures like Falcon H1 where there is overlap between layers that need attention and recurrent caches. ggml-org#13979 (comment) Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

ggml-org#13979 (comment) Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

…y state Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

…ntion pattern https://github.com/ggml-org/llama.cpp/pull/13979/files#r2141701738 This is a big overhaul to bring consistency between how inputs and per- layer components are created for attention layers and recurrent layers. The main changes are: - Rename class llm_graph_input_s_copy -> llm_graph_input_rs - Add a corresponding llm_graph_input_rs_hybrid_recurrent - Rename build_inp_s_copy -> build_rs_inp_recurrent - Add a corresponding build_rs_inp_hybrid_recurrent - Rename build_recurrent_state -> build_rs to match build_attn w/ llm_graph_input_rs * as the first input - Add a corresponding overload of build_rs w/ llm_graph_input_rs_hybrid_recurrent * as the first input - Add a llm_graph_input_attn_kv_hybrid_recurrent analogous to llm_graph_input_attn_kv_unified - Add a build_attn override that takes llm_graph_input_attn_kv_hybrid_recurrent * as the first input This makes the two paradigms fully consistent. The main drawback is the code duplication in the build_attn and build_rs implementations where the only difference between implementations is how they cast the memory state. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* origin/compilade/mamba2: (27 commits) ggml-cpu : reorder SVE FMA for consistency with other SIMD arches ggml : fix mamba2 ssm scan when compiled with SVE graph : fix recurrent state copies when avoiding copies kv-cache : allow context shift for recurrent models convert : avoid AutoConfig for Mamba and Mamba2 hparams kv-cache : remove const_cast when setting inputs for s_copy metal : single-user mamba2 inference works metal : add missing args for nb references in ssm_scan_f32_group metal : fix confusion between ; and , convert : fix flake8 lint ggml : avoid multiply by D in GGML_OP_SSM_SCAN ggml : remove unused fast broadcast path in GGML_MUL metal : fix wrong number of tokens per sequence in SSM_SCAN metal : fix SSM_SCAN state head offset metal : add back n_seqs to SSM_SCAN args metal : remove unused arguments for SSM_SCAN metal : use log and exp instead of log1pf and expf in SSM_SCAN metal : fix SSM_SCAN pipeline scope metal : attempt to adapt SSM_SCAN for Mamba-2 llama : avoid redundant state copy for Mamba 1 and 2 ...

Also, split llama_model_is_recurrent into llm_arch_is_recurrent in llama-arch with llama_model_is_recurrent delegating to llm_arch_is_recurrent. The same split is done for hybird. This is needed because there are places where the llama_model has not yet been initialized but we need to check if the model is recurrent (specifically for the per-layer recurrent check array in hparams). Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

This is no longer needed now that there are separate implementations ggml-org#13979 (comment) Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

This should help support architectures like Falcon H1 where there is overlap between layers that need attention and recurrent caches. ggml-org#13979 (comment) Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

ggml-org#13979 (comment) Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

…y state Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

…ntion pattern https://github.com/ggml-org/llama.cpp/pull/13979/files#r2141701738 This is a big overhaul to bring consistency between how inputs and per- layer components are created for attention layers and recurrent layers. The main changes are: - Rename class llm_graph_input_s_copy -> llm_graph_input_rs - Add a corresponding llm_graph_input_rs_hybrid_recurrent - Rename build_inp_s_copy -> build_rs_inp_recurrent - Add a corresponding build_rs_inp_hybrid_recurrent - Rename build_recurrent_state -> build_rs to match build_attn w/ llm_graph_input_rs android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a corresponding overload of build_rs w/ llm_graph_input_rs_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a llm_graph_input_attn_kv_hybrid_recurrent analogous to llm_graph_input_attn_kv_unified - Add a build_attn override that takes llm_graph_input_attn_kv_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input This makes the two paradigms fully consistent. The main drawback is the code duplication in the build_attn and build_rs implementations where the only difference between implementations is how they cast the memory state. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

@younesbelkada

https://github.com/ggml-org/llama.cpp/pull/13979/files#r2149469788 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-Authored-By: @younesbelkada

Since initially writing this PR, the logic in the child state types changed such that using the "init full" signature and keeping the ubatches on the parent struct no longer worked. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

…ostic This reduces the code duplication between the different build_rs impls and also retains a similar signature to the previous build_recurrent_state method while standardizing on the input-dispatched build_rs implementation. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* origin/compilade/mamba2: (27 commits) ggml-cpu : reorder SVE FMA for consistency with other SIMD arches ggml : fix mamba2 ssm scan when compiled with SVE graph : fix recurrent state copies when avoiding copies kv-cache : allow context shift for recurrent models convert : avoid AutoConfig for Mamba and Mamba2 hparams kv-cache : remove const_cast when setting inputs for s_copy metal : single-user mamba2 inference works metal : add missing args for nb references in ssm_scan_f32_group metal : fix confusion between ; and , convert : fix flake8 lint ggml : avoid multiply by D in GGML_OP_SSM_SCAN ggml : remove unused fast broadcast path in GGML_MUL metal : fix wrong number of tokens per sequence in SSM_SCAN metal : fix SSM_SCAN state head offset metal : add back n_seqs to SSM_SCAN args metal : remove unused arguments for SSM_SCAN metal : use log and exp instead of log1pf and expf in SSM_SCAN metal : fix SSM_SCAN pipeline scope metal : attempt to adapt SSM_SCAN for Mamba-2 llama : avoid redundant state copy for Mamba 1 and 2 ...

…inference running

younesbelkada · 2025-06-17T11:26:15Z

src/llama-kv-cache-hybrid-recurrent.cpp

@@ -32,7 +32,7 @@ llama_kv_cache_hybrid_recurrent::llama_kv_cache_hybrid_recurrent(
    kv_attn(new llama_kv_cache_unified(
        model,
        attn_filter == nullptr ?
-            [&](int32_t il) { return !model.hparams.recurrent_layer(il); }
+            [&](int32_t il) { return model.hparams.recurrent_layer(il); }


The only open question I have on my end is how we can refactor this so that it can also accept models where we have attn and ssm - do you have any idea how we can tackle this @gabe-l-hart ? 👀

I think this should already be doable, but it will need to be done in llama_model::create_memory. My intent with the refactor starting here is that there should be default behavior for models based on llm_arch_is_recurrent and llm_arch_is_hybrid_recurrent, but that individual models that need special initialization can add themselves to the switch statement before the default and initialize the correct kind of memory for themselves (although, it would be even nicer if there were simply a way to override create_memory for a given model type's builder class).

For H1, I think you would essentially copy the clause used to initialize the hybrid recurrent cache (here) and use the option to add overriding filters to make "always on" filters for both child cache types.

Thank you for the pointer @gabe-l-hart ! Will look into that

younesbelkada · 2025-06-17T11:30:05Z

src/llama-model.cpp

    std::fill(
        hparams.recurrent_layer_arr.begin(),
        hparams.recurrent_layer_arr.end(),
-        llm_arch_is_recurrent(ml.get_arch()));
+        true);


Also this is hard-coded here we'll need to change it

My goal with hard-coding this here was to make this a "sane default" for models that don't have specific override logic. For Granite 4, I overwrite this here when parsing hparams.

gabe-l-hart · 2025-06-17T15:36:55Z

src/llama-graph.cpp

@@ -604,6 +604,11 @@ ggml_tensor * llm_graph_context::build_ffn(
            case LLM_FFN_PAR:
                {
                    cur = build_lora_mm(gate, cur);
+
+                    if (arch == LLM_ARCH_FALCON_H1) {
+                        cur = ggml_scale(ctx0, cur, hparams.mlp_gate_multiplier);


Depending on what the core maintainers say about the addition of these new hparams for H1, I think the best approach here would to simply check if hparams.mlp_gate_multiplier is 0 rather than doing an architecture check. I think in general, the repo is trying to consolidate the number of places that individual architecture enums are checked in order to slowly encapsulate the logic for a given architecture into a self-contained class (this is speculation, but it looks like the direction that things are headed).

Makes sense, will do that

gabe-l-hart · 2025-06-17T15:40:10Z

src/llama-hparams.h

+    uint32_t ssm_head_dim   = 0;
+    uint32_t ssm_mamba_d_ssm = 0;
+
+    // Falcon-H1 specific parameters


This is definitely interesting. My guess is that there will be some pushback on adding this many new hparams to the central struct that are specific to a single architecture. Many of these appear to have analogous params elsewhere in the struct, so the general preference is to reuse existing params (even if their names don't match with the HF names) and then do the name translation in the conver_hf_to_gguf.py / gguf-py portion.

gabe-l-hart · 2025-06-17T15:47:33Z

src/llama-model.cpp

+        ggml_build_forward_expand(gf, cur);
+    }
+
+    ggml_tensor * build_mamba2_layer(


I'm in the same boat for Granite 4 where I've copy-pasted this build_mamba2_layer method. I've tried several other ways of making things more composable, but they each have their downsides:

Making build_mamba_model::build_mamba*_layer methods static and then passing this when calling them. This is nice because other model builders can call them by simply calling build_mamba_model::build_mamba2_layer(this, ..., but it makes the method signature gross and causes the "child model type" to know about the "parent model type" which is a bit of an inversion of how the logical composition should go.

Create a special base class that doesn't inherit from llm_graph_context as a mixin (llm_build_mamba_mixin) where it is initialized with a llm_graph_context * called self and then uses that in the implementations of the layer builders. This is nicer when using them from builders that need them because you simply have to inherit from the mixin class, initialize it with this, then call the method when needed like normal. The downside is that the mixin pattern isn't used anywhere else in the repo, so it would be a net-new design pattern that might add complexity.

I also tried direct diamond inheritance (llm_build_hybrid_mamba inherits from both llm_build_mamba and llm_build_granite which both inherit from llm_graph_context), but since the builder classes all implement their graph initialization in their constructors, this failed badly due to initializing all the models on the same graph.

At the moment, because of the complexity of these composition solutions, I'm sticking with the "copy-paste stuff" approach, but I'm going to keep poking at it to see if there's a composition approach that will actually feel clean for hybrid models like this that borrow layer builders from other models.

Yes I also faced this !
For H1 it's a bit more complex because we use d_ssm from the config / hparams whereas other models use d_inner. Also in our case we have this additional mup_vector that we use right after the proejction of ssm_proj so I decided to stick into the copy-paste approach

That makes sense. I think it's a good argument in favor of copy-paste: Even if a lot of the logic is the same, it gives each model full ownership over exactly how that logic is wired up and avoids the challenges when a "parent model" makes changes that accidentally break a "child model" that depends on it. This is very similar to how transformers does it, though they've swung back a bit with the modular approach. The more I think about it, though, the more I think I'll stick with this for Granite 4 as well.

younesbelkada · 2025-06-19T06:19:27Z

Thank you @gabe-l-hart for your review ! I also noticed that the hybrid cache PR got finally merged. Before addressing your comments, I was thinking of merging my branch with current llama.cpp master first, what do you think ?

younesbelkada · 2025-06-19T11:38:57Z

It seems like Mamba2 might be close to landing: ggml-org#9126 (comment) I might also wait for that PR being merged on master and re-merge this branch into master

gabe-l-hart · 2025-06-19T12:33:13Z

Yep, that makes total sense. I'm going to do the same with Granite 4 and the hopefully finally point it at master!

gabe-l-hart and others added 30 commits June 12, 2025 13:54

feat: Add c++ side constants for attention layer indices hparam

18224b7

Branch: GraniteFour

feat: Add support for distinguishing recurrent vs non-recurrent layer…

4782ac0

…s in hparams Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

feat: Auto-fill hparams.recurrent_layer_arr based on whether the mode…

6a397e5

…l is recurrent Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

feat: Add layer filter to recurrent cache

263d913

Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

fix: Use per-layer sizing everywhere in kv caches

a7bd782

Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

fix: Fix wrong bool condition for split equal in hybrid cache

781d5aa

Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

fix: Fix shift logic to defer to unified cache

a3f58d9

Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

feat: Support hybrid recurrent in llama-graph

c8c9aa2

NOTE: I intentionally did not add support for s_mask since it will be going away soon Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

fix: Fix logic for initializing inputs and attn layers for hybrid caches

e77503f

Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

fix: Update recurrent cache for changes to remove intermediate kv_cac…

d327af7

…he interface Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

fix: Fix status for init_update sig for recurrent cache state

ee9b31c

Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

fix: Add missing padding to n_ctx for hybrid cache construction

c3c0ee6

Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

fix: Update clear signature for data argument after rebase

df695b3

Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

fix: Remove errant virtual destructor leftover from previous impl att…

9520af2

…empt Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

fix: Use per-layer n_embd_k/v_s calls for mamba (1) layers

6361a73

Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

refactor: Remove n_embd_k/v_s from unified cache

b115052

No longer needed now that unified isn't also supporting recurrent ggml-org#13979 (comment) Branch: HybridRecurrentCache

refactor: Remove layer index from n_embd_k/v_s

8690495

Now that it's not used at all in the unified cache, we don't need to use the layer index to zero it out for attention layers. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

refactor: Remove n_embd_k/v_gqa from recurrent cache

fa6dbe8

This is no longer needed now that there are separate implementations ggml-org#13979 (comment) Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

fix: Remove logits_all after rebase

74ad4f8

Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

fix: Remove llama_model_is_hybrid_Recurrent public API

b3c948a

ggml-org#13979 (comment) Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

refactor: Use llama_memory_state_ptr for child states in hybrid memor…

f3e34bb

…y state Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

some changes

0385fc0

gabe-l-hart and others added 18 commits June 16, 2025 15:18

refactor: Remove n_embd_k/v_gqa from recurrent cache

b82a225

This is no longer needed now that there are separate implementations ggml-org#13979 (comment) Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

fix: Remove logits_all after rebase

3ee2222

Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

fix: Remove llama_model_is_hybrid_Recurrent public API

bd37fc8

ggml-org#13979 (comment) Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

refactor: Use llama_memory_state_ptr for child states in hybrid memor…

2e5e45c

…y state Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

fix: Fix resize vs reserve and skip null tensors in size computation

beddd62

https://github.com/ggml-org/llama.cpp/pull/13979/files#r2149469788 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-Authored-By: @younesbelkada

Merge remote-tracking branch 'gabe/mamba2-sync' into add-fh1-clean + …

9d30fd4

…inference running

more clean ups

aff19ae

fix inference

9150fc9

Commit credits attribution

11857e8

Commit credits attribution

ebfcb6f

Commit credits attribution

82f59d0

add convert script

837be3e

add missing elements in py files

41f25cf

younesbelkada commented Jun 17, 2025

View reviewed changes

gabe-l-hart reviewed Jun 17, 2025

View reviewed changes

gabe-l-hart mentioned this pull request Jun 17, 2025

MODEL: Falcon-H1 support ggml-org/llama.cpp#14238

Closed

gabe-l-hart force-pushed the mamba2-sync branch from e16aacc to 19a3072 Compare June 18, 2025 17:16

gabe-l-hart force-pushed the mamba2-sync branch 2 times, most recently from b605bb9 to fdc9a8d Compare June 27, 2025 17:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DRAFT: PR to check diff between `mamba2-sync` and `Falcon-H1` #1

DRAFT: PR to check diff between `mamba2-sync` and `Falcon-H1` #1

Uh oh!

younesbelkada commented Jun 17, 2025

Uh oh!

younesbelkada Jun 17, 2025

Uh oh!

gabe-l-hart Jun 17, 2025

Uh oh!

younesbelkada Jun 19, 2025

Uh oh!

younesbelkada Jun 17, 2025

Uh oh!

gabe-l-hart Jun 17, 2025

Uh oh!

gabe-l-hart Jun 17, 2025

Uh oh!

younesbelkada Jun 19, 2025

Uh oh!

gabe-l-hart Jun 17, 2025

Uh oh!

gabe-l-hart Jun 17, 2025

Uh oh!

younesbelkada Jun 19, 2025 •

edited

Loading

Uh oh!

gabe-l-hart Jun 19, 2025

Uh oh!

younesbelkada commented Jun 19, 2025

Uh oh!

younesbelkada commented Jun 19, 2025

Uh oh!

gabe-l-hart commented Jun 19, 2025

Uh oh!

Uh oh!

DRAFT: PR to check diff between mamba2-sync and Falcon-H1 #1

Are you sure you want to change the base?

DRAFT: PR to check diff between mamba2-sync and Falcon-H1 #1

Uh oh!

Conversation

younesbelkada commented Jun 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

younesbelkada Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

younesbelkada commented Jun 19, 2025

Uh oh!

younesbelkada commented Jun 19, 2025

Uh oh!

gabe-l-hart commented Jun 19, 2025

Uh oh!

Uh oh!

DRAFT: PR to check diff between `mamba2-sync` and `Falcon-H1` #1

DRAFT: PR to check diff between `mamba2-sync` and `Falcon-H1` #1

younesbelkada Jun 19, 2025 •

edited

Loading