Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml_metal_init: ggm-common.h not found #211

Closed
Raphy42 opened this issue Mar 20, 2024 · 8 comments
Closed

ggml_metal_init: ggm-common.h not found #211

Raphy42 opened this issue Mar 20, 2024 · 8 comments
Labels
🐛 bug something is broken 🏗 build 🍎 mac macos only updates

Comments

@Raphy42
Copy link

Raphy42 commented Mar 20, 2024

I updated to the latest version of the library, as I needed to have the command-r architecture support, but the curent crates.io and main version currently crash on MacOS due to metal_hack breaking in the latest version of llama.cpp
The culprit is ggm-common.h which is not avaiable to the bundled shader. I have tried replacing the .h by it's actual content, prior to putting it inside the .m loader, but it's not as simple and is not going to be maintainable at all.

ggml_metal_init: error: Error Domain=MTLLibraryErrorDomain Code=3 "program_source:3:10: fatal error: 'ggml-common.h' file not found

I saw on the llama.cpp issues that this could be fixed by having the default.metallib built by the CMake project, but this would imply modifying the current build.rs heavily, and I have no CUDA compatible machine.

@MarcusDunn
Copy link
Contributor

MarcusDunn commented Mar 20, 2024

756646c

Did this commi fix the issue?

Does it fail to compile or does it crash?

Forgive me, Mac and metal are completely foreign to me.

@Raphy42
Copy link
Author

Raphy42 commented Mar 20, 2024

Sadly it doesn't, I'm currently messing around with my own public fork which uses the cmake crate to build everything, which works !
I don't remember what I did anymore but I've successfully changed the build.rs to build ggml-metal-embed.metal which is then copied into the package workspace, I then added linking arguments until it worked

llama_model_loader: loaded meta data with 23 key-value pairs and 322 tensors from /Users/<>/.cache/huggingface/hub/models--andrewcanis--c4ai-command-r-v01-GGUF/snapshots/7629a21caf04be51c9010f3ece50e4f8178e0ef1/c4ai-command-r-v01-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = command-r
llama_model_loader: - kv   1:                               general.name str              = 9fe64d67d13873f218cb05083b6fc2faab2d034a
llama_model_loader: - kv   2:                      command-r.block_count u32              = 40
llama_model_loader: - kv   3:                   command-r.context_length u32              = 131072
llama_model_loader: - kv   4:                 command-r.embedding_length u32              = 8192
llama_model_loader: - kv   5:              command-r.feed_forward_length u32              = 22528
llama_model_loader: - kv   6:             command-r.attention.head_count u32              = 64
llama_model_loader: - kv   7:          command-r.attention.head_count_kv u32              = 64
llama_model_loader: - kv   8:                   command-r.rope.freq_base f32              = 8000000.000000
llama_model_loader: - kv   9:     command-r.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 15
llama_model_loader: - kv  11:                      command-r.logit_scale f32              = 0.062500
llama_model_loader: - kv  12:                command-r.rope.scaling.type str              = none
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,256000]  = ["<PAD>", "<UNK>", "<CLS>", "<SEP>", ...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,253333]  = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ a...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 5
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 255001
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   41 tensors
llama_model_loader: - type q4_K:  240 tensors
llama_model_loader: - type q6_K:   41 tensors
.......
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors:        CPU buffer size =  1640.62 MiB
llm_load_tensors:      Metal buffer size = 20519.41 MiB
........................................................................................
2024-03-20T20:57:00.994794Z DEBUG load_from_file: llama_cpp_2::model: Loaded model path="/Users/<>/.cache/huggingface/hub/models--andrewcanis--c4ai-command-r-v01-GGUF/snapshots/7629a21caf04be51c9010f3ece50e4f8178e0ef1/c4ai-command-r-v01-Q4_K_M.gguf"
llama_new_context_with_model: n_ctx      = 131072
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 8000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      Metal KV buffer size = 163840.00 MiB
llama_new_context_with_model: KV self size  = 163840.00 MiB, K (f16): 81920.00 MiB, V (f16): 81920.00 MiB
llama_new_context_with_model:        CPU  output buffer size =  2000.00 MiB
ggml_gallocr_reserve_n: failed to allocate Metal buffer of size 17515415552
llama_new_context_with_model: failed to allocate compute buffers

But my changes break CUDA support and other platforms than macos, I don't have the bandwith to stabilise my fork right now, but I will be sure to open a PR once I have reproducible build on both platforms, but I don't know when, if anyone want to give my fork a try, feel free to do so !

@MarcusDunn
Copy link
Contributor

MarcusDunn commented Mar 20, 2024

Sadly it doesn't, I'm currently messing around with my own public fork which uses the cmake crate to build everything, which works !

I've been meaning to give this a try - I'd be happy to have a PR to this effect once it's ready. If you open a draft I can edit I can try to work on linux + cuda.

@MarcusDunn MarcusDunn added 🍎 mac macos only updates 🐛 bug something is broken 🏗 build labels Mar 20, 2024
@Raphy42
Copy link
Author

Raphy42 commented Mar 20, 2024

Sure ! Current edits are a bit hacky but this is my current impl Raphy42@44b6da4
I need to dig deeper in the llama.cpp build system in order to make the build.rs nicer and more parametric

@MarcusDunn
Copy link
Contributor

MarcusDunn commented Mar 26, 2024

@Raphy42 does #221 fix it? (no llama_hack, unsure if needed if using cmake)

@MarcusDunn
Copy link
Contributor

may be fixed on latest. Unable to test myself but #224 apparently fixes this, please let me know.

@Raphy42
Copy link
Author

Raphy42 commented Mar 31, 2024

@Raphy42 does #221 fix it? (no llama_hack, unsure if needed if using cmake)

Yeah, "0.1.45" works out of the box for me, tested on gemma-7b and multiple llama-2 variants with the same inference speed from before !
Amazing job !

@MarcusDunn
Copy link
Contributor

i just merged a PR. All credit to derrickpersson!

Glad it works!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛 bug something is broken 🏗 build 🍎 mac macos only updates
Projects
None yet
Development

No branches or pull requests

2 participants