-
Notifications
You must be signed in to change notification settings - Fork 1
DRAFT: PR to check diff between mamba2-sync
and Falcon-H1
#1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: mamba2-sync
Are you sure you want to change the base?
Conversation
Also, split llama_model_is_recurrent into llm_arch_is_recurrent in llama-arch with llama_model_is_recurrent delegating to llm_arch_is_recurrent. The same split is done for hybird. This is needed because there are places where the llama_model has not yet been initialized but we need to check if the model is recurrent (specifically for the per-layer recurrent check array in hparams). Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: GraniteFour
…s in hparams Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…l is recurrent Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
The implementation of the hybrid cache intentionally does not specify the types of the child caches, so there was a naming mismatch with these predicate functions that used "hybrid" to imply "hybrid recurrent." Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This follows the pattern in iswa where the two child caches are held explicitly to support the case where a model requires a single attention cache and a single recurrent cache where each layer uses exactly one of the caches. This is a rewrite of the more generic approach in the original hybrid cache PR: ggml-org#13276 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This includes a refactor of the create_memory logic to avoid needing to use the arch enum explicitly unless a model needs explicit cache instantiation logic beyond the standard logic for recurrent, hybrid, unified, and iswa. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
NOTE: I intentionally did not add support for s_mask since it will be going away soon Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…he interface Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…empt Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
No longer needed now that unified isn't also supporting recurrent ggml-org#13979 (comment) Branch: HybridRecurrentCache
Now that it's not used at all in the unified cache, we don't need to use the layer index to zero it out for attention layers. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This is no longer needed now that there are separate implementations ggml-org#13979 (comment) Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This should help support architectures like Falcon H1 where there is overlap between layers that need attention and recurrent caches. ggml-org#13979 (comment) Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
ggml-org#13979 (comment) Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…y state Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…ntion pattern https://github.com/ggml-org/llama.cpp/pull/13979/files#r2141701738 This is a big overhaul to bring consistency between how inputs and per- layer components are created for attention layers and recurrent layers. The main changes are: - Rename class llm_graph_input_s_copy -> llm_graph_input_rs - Add a corresponding llm_graph_input_rs_hybrid_recurrent - Rename build_inp_s_copy -> build_rs_inp_recurrent - Add a corresponding build_rs_inp_hybrid_recurrent - Rename build_recurrent_state -> build_rs to match build_attn w/ llm_graph_input_rs * as the first input - Add a corresponding overload of build_rs w/ llm_graph_input_rs_hybrid_recurrent * as the first input - Add a llm_graph_input_attn_kv_hybrid_recurrent analogous to llm_graph_input_attn_kv_unified - Add a build_attn override that takes llm_graph_input_attn_kv_hybrid_recurrent * as the first input This makes the two paradigms fully consistent. The main drawback is the code duplication in the build_attn and build_rs implementations where the only difference between implementations is how they cast the memory state. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* origin/compilade/mamba2: (27 commits) ggml-cpu : reorder SVE FMA for consistency with other SIMD arches ggml : fix mamba2 ssm scan when compiled with SVE graph : fix recurrent state copies when avoiding copies kv-cache : allow context shift for recurrent models convert : avoid AutoConfig for Mamba and Mamba2 hparams kv-cache : remove const_cast when setting inputs for s_copy metal : single-user mamba2 inference works metal : add missing args for nb references in ssm_scan_f32_group metal : fix confusion between ; and , convert : fix flake8 lint ggml : avoid multiply by D in GGML_OP_SSM_SCAN ggml : remove unused fast broadcast path in GGML_MUL metal : fix wrong number of tokens per sequence in SSM_SCAN metal : fix SSM_SCAN state head offset metal : add back n_seqs to SSM_SCAN args metal : remove unused arguments for SSM_SCAN metal : use log and exp instead of log1pf and expf in SSM_SCAN metal : fix SSM_SCAN pipeline scope metal : attempt to adapt SSM_SCAN for Mamba-2 llama : avoid redundant state copy for Mamba 1 and 2 ...
Also, split llama_model_is_recurrent into llm_arch_is_recurrent in llama-arch with llama_model_is_recurrent delegating to llm_arch_is_recurrent. The same split is done for hybird. This is needed because there are places where the llama_model has not yet been initialized but we need to check if the model is recurrent (specifically for the per-layer recurrent check array in hparams). Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This is no longer needed now that there are separate implementations ggml-org#13979 (comment) Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This should help support architectures like Falcon H1 where there is overlap between layers that need attention and recurrent caches. ggml-org#13979 (comment) Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
ggml-org#13979 (comment) Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…y state Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…ntion pattern https://github.com/ggml-org/llama.cpp/pull/13979/files#r2141701738 This is a big overhaul to bring consistency between how inputs and per- layer components are created for attention layers and recurrent layers. The main changes are: - Rename class llm_graph_input_s_copy -> llm_graph_input_rs - Add a corresponding llm_graph_input_rs_hybrid_recurrent - Rename build_inp_s_copy -> build_rs_inp_recurrent - Add a corresponding build_rs_inp_hybrid_recurrent - Rename build_recurrent_state -> build_rs to match build_attn w/ llm_graph_input_rs android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a corresponding overload of build_rs w/ llm_graph_input_rs_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a llm_graph_input_attn_kv_hybrid_recurrent analogous to llm_graph_input_attn_kv_unified - Add a build_attn override that takes llm_graph_input_attn_kv_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input This makes the two paradigms fully consistent. The main drawback is the code duplication in the build_attn and build_rs implementations where the only difference between implementations is how they cast the memory state. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
https://github.com/ggml-org/llama.cpp/pull/13979/files#r2149469788 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-Authored-By: @younesbelkada
Since initially writing this PR, the logic in the child state types changed such that using the "init full" signature and keeping the ubatches on the parent struct no longer worked. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…ostic This reduces the code duplication between the different build_rs impls and also retains a similar signature to the previous build_recurrent_state method while standardizing on the input-dispatched build_rs implementation. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* origin/compilade/mamba2: (27 commits) ggml-cpu : reorder SVE FMA for consistency with other SIMD arches ggml : fix mamba2 ssm scan when compiled with SVE graph : fix recurrent state copies when avoiding copies kv-cache : allow context shift for recurrent models convert : avoid AutoConfig for Mamba and Mamba2 hparams kv-cache : remove const_cast when setting inputs for s_copy metal : single-user mamba2 inference works metal : add missing args for nb references in ssm_scan_f32_group metal : fix confusion between ; and , convert : fix flake8 lint ggml : avoid multiply by D in GGML_OP_SSM_SCAN ggml : remove unused fast broadcast path in GGML_MUL metal : fix wrong number of tokens per sequence in SSM_SCAN metal : fix SSM_SCAN state head offset metal : add back n_seqs to SSM_SCAN args metal : remove unused arguments for SSM_SCAN metal : use log and exp instead of log1pf and expf in SSM_SCAN metal : fix SSM_SCAN pipeline scope metal : attempt to adapt SSM_SCAN for Mamba-2 llama : avoid redundant state copy for Mamba 1 and 2 ...
…inference running
@@ -32,7 +32,7 @@ llama_kv_cache_hybrid_recurrent::llama_kv_cache_hybrid_recurrent( | |||
kv_attn(new llama_kv_cache_unified( | |||
model, | |||
attn_filter == nullptr ? | |||
[&](int32_t il) { return !model.hparams.recurrent_layer(il); } | |||
[&](int32_t il) { return model.hparams.recurrent_layer(il); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only open question I have on my end is how we can refactor this so that it can also accept models where we have attn and ssm - do you have any idea how we can tackle this @gabe-l-hart ? 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should already be doable, but it will need to be done in llama_model::create_memory
. My intent with the refactor starting here is that there should be default behavior for models based on llm_arch_is_recurrent
and llm_arch_is_hybrid_recurrent
, but that individual models that need special initialization can add themselves to the switch
statement before the default
and initialize the correct kind of memory for themselves (although, it would be even nicer if there were simply a way to override create_memory
for a given model type's builder class).
For H1, I think you would essentially copy the clause used to initialize the hybrid recurrent cache (here) and use the option to add overriding filters to make "always on" filters for both child cache types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the pointer @gabe-l-hart ! Will look into that
std::fill( | ||
hparams.recurrent_layer_arr.begin(), | ||
hparams.recurrent_layer_arr.end(), | ||
llm_arch_is_recurrent(ml.get_arch())); | ||
true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this is hard-coded here we'll need to change it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My goal with hard-coding this here was to make this a "sane default" for models that don't have specific override logic. For Granite 4, I overwrite this here when parsing hparams
.
@@ -604,6 +604,11 @@ ggml_tensor * llm_graph_context::build_ffn( | |||
case LLM_FFN_PAR: | |||
{ | |||
cur = build_lora_mm(gate, cur); | |||
|
|||
if (arch == LLM_ARCH_FALCON_H1) { | |||
cur = ggml_scale(ctx0, cur, hparams.mlp_gate_multiplier); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on what the core maintainers say about the addition of these new hparams
for H1, I think the best approach here would to simply check if hparams.mlp_gate_multiplier
is 0
rather than doing an architecture check. I think in general, the repo is trying to consolidate the number of places that individual architecture enums are checked in order to slowly encapsulate the logic for a given architecture into a self-contained class (this is speculation, but it looks like the direction that things are headed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, will do that
uint32_t ssm_head_dim = 0; | ||
uint32_t ssm_mamba_d_ssm = 0; | ||
|
||
// Falcon-H1 specific parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely interesting. My guess is that there will be some pushback on adding this many new hparams
to the central struct that are specific to a single architecture. Many of these appear to have analogous params elsewhere in the struct, so the general preference is to reuse existing params (even if their names don't match with the HF names) and then do the name translation in the conver_hf_to_gguf.py
/ gguf-py
portion.
ggml_build_forward_expand(gf, cur); | ||
} | ||
|
||
ggml_tensor * build_mamba2_layer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in the same boat for Granite 4 where I've copy-pasted this build_mamba2_layer
method. I've tried several other ways of making things more composable, but they each have their downsides:
- Making
build_mamba_model::build_mamba*_layer
methodsstatic
and then passingthis
when calling them. This is nice because other model builders can call them by simply callingbuild_mamba_model::build_mamba2_layer(this, ...
, but it makes the method signature gross and causes the "child model type" to know about the "parent model type" which is a bit of an inversion of how the logical composition should go. - Create a special base class that doesn't inherit from
llm_graph_context
as amixin
(llm_build_mamba_mixin
) where it is initialized with allm_graph_context *
calledself
and then uses that in the implementations of the layer builders. This is nicer when using them from builders that need them because you simply have to inherit from the mixin class, initialize it withthis
, then call the method when needed like normal. The downside is that themixin
pattern isn't used anywhere else in the repo, so it would be a net-new design pattern that might add complexity.
I also tried direct diamond inheritance (llm_build_hybrid_mamba
inherits from both llm_build_mamba
and llm_build_granite
which both inherit from llm_graph_context
), but since the builder classes all implement their graph initialization in their constructors, this failed badly due to initializing all the models on the same graph.
At the moment, because of the complexity of these composition solutions, I'm sticking with the "copy-paste stuff" approach, but I'm going to keep poking at it to see if there's a composition approach that will actually feel clean for hybrid models like this that borrow layer builders from other models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I also faced this !
For H1 it's a bit more complex because we use d_ssm
from the config / hparams whereas other models use d_inner
. Also in our case we have this additional mup_vector
that we use right after the proejction of ssm_proj so I decided to stick into the copy-paste approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I think it's a good argument in favor of copy-paste: Even if a lot of the logic is the same, it gives each model full ownership over exactly how that logic is wired up and avoids the challenges when a "parent model" makes changes that accidentally break a "child model" that depends on it. This is very similar to how transformers
does it, though they've swung back a bit with the modular approach. The more I think about it, though, the more I think I'll stick with this for Granite 4 as well.
Thank you @gabe-l-hart for your review ! I also noticed that the hybrid cache PR got finally merged. Before addressing your comments, I was thinking of merging my branch with current llama.cpp master first, what do you think ? |
It seems like Mamba2 might be close to landing: ggml-org#9126 (comment) I might also wait for that PR being merged on master and re-merge this branch into master |
Yep, that makes total sense. I'm going to do the same with Granite 4 and the hopefully finally point it at |
b605bb9
to
fdc9a8d
Compare
No description provided.