Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: Cannot start the GGUF model #1719

Open
2 of 7 tasks
grzegorz-bielski opened this issue Nov 24, 2024 · 1 comment
Open
2 of 7 tasks

bug: Cannot start the GGUF model #1719

grzegorz-bielski opened this issue Nov 24, 2024 · 1 comment
Assignees
Labels
category: tools RAG, function calling, etc type: bug Something isn't working

Comments

@grzegorz-bielski
Copy link
Contributor

grzegorz-bielski commented Nov 24, 2024

Cortex version

1.0.3

Describe the issue and expected behaviour

I tried to run the https://huggingface.co/yixuan-chia/snowflake-arctic-embed-m-GGUF for the OpenAI compatible embeddings endpoint, but the model couldn't be started - see the reproduction steps and logs below.

Same thing happens for the https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1.

On the side note, I tried to also run the nomic-embed-text-v1 from the built-in models list but running either cortex pull cortexso/nomic-embed-text-v1 or cortex run nomic-embed-text-v1 results in No variant available. Seems like a separate issue though.

Steps to Reproduce

 # the model gets downloaded just fine
cortex pull mixedbread-ai/mxbai-embed-large-v1
# I can access the info about it without any problems
cortex models get yixuan-chia:snowflake-arctic-embed-m-GGUF:snowflake-arctic-embed-m-F16.gguf 
# But it crashes here with `HTTP error: Failed to read connection`
cortex models start yixuan-chia:snowflake-arctic-embed-m-GGUF:snowflake-arctic-embed-m-F16.gguf

Screenshots / Logs

The logs for the last crashing command from ~/cortexcpp/logs/cortex.log (not ~/cortex/logs/ as in ticket template, maybe it has to be updated)

20241124 11:21:38.970271 UTC 69532246 INFO  Host: 127.0.0.1 Port: 39281
 - main.cc:80
20241124 11:21:38.971643 UTC 69532246 INFO  cortex.cpp version: v1.0.3 - main.cc:89
20241124 11:21:38.981916 UTC 69532246 INFO  nvidia-smi is not available! - system_info_utils.h:130
20241124 11:21:38.983383 UTC 69532246 INFO  Activated GPUs before:  - hardware_service.cc:244
20241124 11:21:38.983435 UTC 69532246 INFO  Activated GPUs after:  - hardware_service.cc:268
20241124 11:21:38.986450 UTC 69532246 INFO  Starting worker thread: 0 - download_service.cc:302
20241124 11:21:38.986504 UTC 69532246 INFO  Starting worker thread: 1 - download_service.cc:302
20241124 11:21:38.986532 UTC 69532246 INFO  Starting worker thread: 2 - download_service.cc:302
20241124 11:21:38.986555 UTC 69532246 INFO  Starting worker thread: 3 - download_service.cc:302
20241124 11:21:38.991551 UTC 69532246 INFO  nvidia-smi is not available! - system_info_utils.h:130
20241124 11:21:38.994464 UTC 69532246 INFO  Server started, listening at: 127.0.0.1:39281 - main.cc:140
20241124 11:21:38.994479 UTC 69532246 INFO  Please load your model - main.cc:142
20241124 11:21:38.994494 UTC 69532246 INFO  Number of thread is:10 - main.cc:149
20241124 11:21:39.946448 UTC 69532269 INFO  Origin:  - main.cc:162
20241124 11:21:39.959655 UTC 69532270 INFO  {
	"ai_prompt" : "[/INST]",
	"ai_template" : "[/INST]",
	"created" : 0,
	"ctx_len" : 512,
	"dynatemp_exponent" : 1.0,
	"dynatemp_range" : 0.0,
	"engine" : "llama-cpp",
	"files" :
	[
		"models/huggingface.co/yixuan-chia/snowflake-arctic-embed-m-GGUF/snowflake-arctic-embed-m-F16.gguf"
	],
	"frequency_penalty" : 0.0,
	"gpu_arch" : "",
	"ignore_eos" : false,
	"max_tokens" : 512,
	"min_keep" : 0,
	"min_p" : 0.05000000074505806,
	"mirostat" : false,
	"mirostat_eta" : 0.10000000149011612,
	"mirostat_tau" : 5.0,
	"model" : "yixuan-chia:snowflake-arctic-embed-m-GGUF:snowflake-arctic-embed-m-F16.gguf",
	"model_path" : "/Users/grzegorzbielski/cortexcpp/models/huggingface.co/yixuan-chia/snowflake-arctic-embed-m-GGUF/snowflake-arctic-embed-m-F16.gguf",
	"n_parallel" : 1,
	"n_probs" : 0,
	"name" : "Snowflake-Arctic-Embed-M",
	"ngl" : 13,
	"object" : "",
	"os" : "",
	"owned_by" : "",
	"penalize_nl" : false,
	"precision" : "",
	"presence_penalty" : 0.0,
	"prompt_template" : "[INST] <<SYS>>\n{system_message}\n<</SYS>>\n{prompt}[/INST]",
	"quantization_method" : "",
	"repeat_last_n" : 64,
	"repeat_penalty" : 1.0,
	"seed" : -1,
	"size" : 0,
	"stop" :
	[
		"[PAD]"
	],
	"stream" : true,
	"system_prompt" : "[INST] <<SYS>>\n",
	"system_template" : "[INST] <<SYS>>\n",
	"temperature" : 0.69999998807907104,
	"text_model" : false,
	"tfs_z" : 1.0,
	"top_k" : 40,
	"top_p" : 0.94999998807907104,
	"typ_p" : 1.0,
	"user_prompt" : "\n<</SYS>>\n",
	"user_template" : "\n<</SYS>>\n",
	"version" : "2"
}
 - model_service.cc:667
20241124 11:21:39.974798 UTC 69532270 INFO  nvidia-smi is not available! - system_info_utils.h:130
20241124 11:21:39.976185 UTC 69532270 INFO  is_cuda: 0 - model_service.cc:680
20241124 11:21:39.976265 UTC 69532270 INFO  Loading engine: cortex.llamacpp - engine_service.cc:771
20241124 11:21:39.977503 UTC 69532270 INFO  Selected engine variant: {"engine":"cortex.llamacpp","variant":"mac-arm64","version":"v0.1.39"} - engine_service.cc:780
20241124 11:21:39.978785 UTC 69532270 INFO  Engine path: /Users/grzegorzbielski/cortexcpp/engines/cortex.llamacpp/mac-arm64/v0.1.39 - engine_service.cc:805
20241124 11:21:39.983962 UTC 69532270 INFO  cortex.llamacpp version: 0.1.39 - llama_engine.cc:308
20241124 11:21:39.984439 UTC 69532270 INFO  Number of parallel is set to 1 - llama_engine.cc:544
20241124 11:21:39.984490 UTC 69532270 INFO  system info: {'n_thread': 10, 'total_threads': 10. 'system_info': 'AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | '} - llama_engine.cc:620
20241124 11:21:40.052148 UTC 69532270 INFO  llama_load_model_from_file: using device Metal (Apple M1 Max) - 21845 MiB free
 - llama_engine.cc:475
20241124 11:21:40.055512 UTC 69532270 INFO  llama_model_loader: loaded meta data with 26 key-value pairs and 197 tensors from /Users/grzegorzbielski/cortexcpp/models/huggingface.co/yixuan-chia/snowflake-arctic-embed-m-GGUF/snowflake-arctic-embed-m-F16.gguf (version GGUF V3 (latest))
 - llama_engine.cc:475
20241124 11:21:40.055544 UTC 69532270 INFO  llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
 - llama_engine.cc:475
20241124 11:21:40.055562 UTC 69532270 INFO  llama_model_loader: - kv   0:                       general.architecture str              = bert
 - llama_engine.cc:475
20241124 11:21:40.055577 UTC 69532270 INFO  llama_model_loader: - kv   1:                               general.type str              = model
 - llama_engine.cc:475
20241124 11:21:40.055592 UTC 69532270 INFO  llama_model_loader: - kv   2:                               general.name str              = Snowflake Arctic Embed M
 - llama_engine.cc:475
20241124 11:21:40.055607 UTC 69532270 INFO  llama_model_loader: - kv   3:                         general.size_label str              = 109M
 - llama_engine.cc:475
20241124 11:21:40.055621 UTC 69532270 INFO  llama_model_loader: - kv   4:                            general.license str              = apache-2.0
 - llama_engine.cc:475
20241124 11:21:40.055640 UTC 69532270 INFO  llama_model_loader: - kv   5:                               general.tags arr[str,8]       = ["sentence-transformers", "feature-ex...
 - llama_engine.cc:475
20241124 11:21:40.055655 UTC 69532270 INFO  llama_model_loader: - kv   6:                           bert.block_count u32              = 12
 - llama_engine.cc:475
20241124 11:21:40.055669 UTC 69532270 INFO  llama_model_loader: - kv   7:                        bert.context_length u32              = 512
 - llama_engine.cc:475
20241124 11:21:40.055683 UTC 69532270 INFO  llama_model_loader: - kv   8:                      bert.embedding_length u32              = 768
 - llama_engine.cc:475
20241124 11:21:40.055697 UTC 69532270 INFO  llama_model_loader: - kv   9:                   bert.feed_forward_length u32              = 3072
 - llama_engine.cc:475
20241124 11:21:40.055712 UTC 69532270 INFO  llama_model_loader: - kv  10:                  bert.attention.head_count u32              = 12
 - llama_engine.cc:475
20241124 11:21:40.055728 UTC 69532270 INFO  llama_model_loader: - kv  11:          bert.attention.layer_norm_epsilon f32              = 0.000000
 - llama_engine.cc:475
20241124 11:21:40.055746 UTC 69532270 INFO  llama_model_loader: - kv  12:                          general.file_type u32              = 1
 - llama_engine.cc:475
20241124 11:21:40.055761 UTC 69532270 INFO  llama_model_loader: - kv  13:                      bert.attention.causal bool             = false
 - llama_engine.cc:475
20241124 11:21:40.055776 UTC 69532270 INFO  llama_model_loader: - kv  14:                          bert.pooling_type u32              = 2
 - llama_engine.cc:475
20241124 11:21:40.055792 UTC 69532270 INFO  llama_model_loader: - kv  15:            tokenizer.ggml.token_type_count u32              = 2
 - llama_engine.cc:475
20241124 11:21:40.055808 UTC 69532270 INFO  llama_model_loader: - kv  16:                       tokenizer.ggml.model str              = bert
 - llama_engine.cc:475
20241124 11:21:40.055823 UTC 69532270 INFO  llama_model_loader: - kv  17:                         tokenizer.ggml.pre str              = jina-v2-en
 - llama_engine.cc:475
20241124 11:21:40.060101 UTC 69532270 INFO  llama_model_loader: - kv  18:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
 - llama_engine.cc:475
20241124 11:21:40.061260 UTC 69532270 INFO  llama_model_loader: - kv  19:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
 - llama_engine.cc:475
20241124 11:21:40.061277 UTC 69532270 INFO  llama_model_loader: - kv  20:            tokenizer.ggml.unknown_token_id u32              = 100
 - llama_engine.cc:475
20241124 11:21:40.061291 UTC 69532270 INFO  llama_model_loader: - kv  21:          tokenizer.ggml.seperator_token_id u32              = 102
 - llama_engine.cc:475
20241124 11:21:40.061304 UTC 69532270 INFO  llama_model_loader: - kv  22:            tokenizer.ggml.padding_token_id u32              = 0
 - llama_engine.cc:475
20241124 11:21:40.061318 UTC 69532270 INFO  llama_model_loader: - kv  23:                tokenizer.ggml.cls_token_id u32              = 101
 - llama_engine.cc:475
20241124 11:21:40.061332 UTC 69532270 INFO  llama_model_loader: - kv  24:               tokenizer.ggml.mask_token_id u32              = 103
 - llama_engine.cc:475
20241124 11:21:40.061346 UTC 69532270 INFO  llama_model_loader: - kv  25:               general.quantization_version u32              = 2
 - llama_engine.cc:475
20241124 11:21:40.061360 UTC 69532270 INFO  llama_model_loader: - type  f32:  124 tensors
 - llama_engine.cc:475
20241124 11:21:40.061373 UTC 69532270 INFO  llama_model_loader: - type  f16:   73 tensors
 - llama_engine.cc:475
20241124 11:21:40.065239 UTC 69532270 INFO  llm_load_vocab: special tokens cache size = 5
 - llama_engine.cc:475
20241124 11:21:40.067499 UTC 69532270 INFO  llm_load_vocab: token to piece cache size = 0.2032 MB
 - llama_engine.cc:475
20241124 11:21:40.067551 UTC 69532270 INFO  llm_load_print_meta: format           = GGUF V3 (latest)
 - llama_engine.cc:475
20241124 11:21:40.067567 UTC 69532270 INFO  llm_load_print_meta: arch             = bert
 - llama_engine.cc:475
20241124 11:21:40.067585 UTC 69532270 INFO  llm_load_print_meta: vocab type       = WPM
 - llama_engine.cc:475
20241124 11:21:40.067598 UTC 69532270 INFO  llm_load_print_meta: n_vocab          = 30522
 - llama_engine.cc:475
20241124 11:21:40.067612 UTC 69532270 INFO  llm_load_print_meta: n_merges         = 0
 - llama_engine.cc:475
20241124 11:21:40.067625 UTC 69532270 INFO  llm_load_print_meta: vocab_only       = 0
 - llama_engine.cc:475
20241124 11:21:40.067638 UTC 69532270 INFO  llm_load_print_meta: n_ctx_train      = 512
 - llama_engine.cc:475
20241124 11:21:40.067653 UTC 69532270 INFO  llm_load_print_meta: n_embd           = 768
 - llama_engine.cc:475
20241124 11:21:40.067667 UTC 69532270 INFO  llm_load_print_meta: n_layer          = 12
 - llama_engine.cc:475
20241124 11:21:40.067685 UTC 69532270 INFO  llm_load_print_meta: n_head           = 12
 - llama_engine.cc:475
20241124 11:21:40.067700 UTC 69532270 INFO  llm_load_print_meta: n_head_kv        = 12
 - llama_engine.cc:475
20241124 11:21:40.067714 UTC 69532270 INFO  llm_load_print_meta: n_rot            = 64
 - llama_engine.cc:475
20241124 11:21:40.067728 UTC 69532270 INFO  llm_load_print_meta: n_swa            = 0
 - llama_engine.cc:475
20241124 11:21:40.067741 UTC 69532270 INFO  llm_load_print_meta: n_embd_head_k    = 64
 - llama_engine.cc:475
20241124 11:21:40.067755 UTC 69532270 INFO  llm_load_print_meta: n_embd_head_v    = 64
 - llama_engine.cc:475
20241124 11:21:40.067769 UTC 69532270 INFO  llm_load_print_meta: n_gqa            = 1
 - llama_engine.cc:475
20241124 11:21:40.067784 UTC 69532270 INFO  llm_load_print_meta: n_embd_k_gqa     = 768
 - llama_engine.cc:475
20241124 11:21:40.067797 UTC 69532270 INFO  llm_load_print_meta: n_embd_v_gqa     = 768
 - llama_engine.cc:475
20241124 11:21:40.067810 UTC 69532270 INFO  llm_load_print_meta: f_norm_eps       = 1.0e-12
 - llama_engine.cc:475
20241124 11:21:40.067822 UTC 69532270 INFO  llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
 - llama_engine.cc:475
20241124 11:21:40.067834 UTC 69532270 INFO  llm_load_print_meta: f_clamp_kqv      = 0.0e+00
 - llama_engine.cc:475
20241124 11:21:40.067846 UTC 69532270 INFO  llm_load_print_meta: f_max_alibi_bias = 0.0e+00
 - llama_engine.cc:475
20241124 11:21:40.067863 UTC 69532270 INFO  llm_load_print_meta: f_logit_scale    = 0.0e+00
 - llama_engine.cc:475
20241124 11:21:40.067876 UTC 69532270 INFO  llm_load_print_meta: n_ff             = 3072
 - llama_engine.cc:475
20241124 11:21:40.067889 UTC 69532270 INFO  llm_load_print_meta: n_expert         = 0
 - llama_engine.cc:475
20241124 11:21:40.067903 UTC 69532270 INFO  llm_load_print_meta: n_expert_used    = 0
 - llama_engine.cc:475
20241124 11:21:40.067917 UTC 69532270 INFO  llm_load_print_meta: causal attn      = 0
 - llama_engine.cc:475
20241124 11:21:40.067931 UTC 69532270 INFO  llm_load_print_meta: pooling type     = 2
 - llama_engine.cc:475
20241124 11:21:40.067945 UTC 69532270 INFO  llm_load_print_meta: rope type        = 2
 - llama_engine.cc:475
20241124 11:21:40.067958 UTC 69532270 INFO  llm_load_print_meta: rope scaling     = linear
 - llama_engine.cc:475
20241124 11:21:40.067972 UTC 69532270 INFO  llm_load_print_meta: freq_base_train  = 10000.0
 - llama_engine.cc:475
20241124 11:21:40.067986 UTC 69532270 INFO  llm_load_print_meta: freq_scale_train = 1
 - llama_engine.cc:475
20241124 11:21:40.068000 UTC 69532270 INFO  llm_load_print_meta: n_ctx_orig_yarn  = 512
 - llama_engine.cc:475
20241124 11:21:40.068014 UTC 69532270 INFO  llm_load_print_meta: rope_finetuned   = unknown
 - llama_engine.cc:475
20241124 11:21:40.068027 UTC 69532270 INFO  llm_load_print_meta: ssm_d_conv       = 0
 - llama_engine.cc:475
20241124 11:21:40.068039 UTC 69532270 INFO  llm_load_print_meta: ssm_d_inner      = 0
 - llama_engine.cc:475
20241124 11:21:40.068051 UTC 69532270 INFO  llm_load_print_meta: ssm_d_state      = 0
 - llama_engine.cc:475
20241124 11:21:40.068063 UTC 69532270 INFO  llm_load_print_meta: ssm_dt_rank      = 0
 - llama_engine.cc:475
20241124 11:21:40.068075 UTC 69532270 INFO  llm_load_print_meta: ssm_dt_b_c_rms   = 0
 - llama_engine.cc:475
20241124 11:21:40.068090 UTC 69532270 INFO  llm_load_print_meta: model type       = 109M
 - llama_engine.cc:475
20241124 11:21:40.068108 UTC 69532270 INFO  llm_load_print_meta: model ftype      = F16
 - llama_engine.cc:475
20241124 11:21:40.068120 UTC 69532270 INFO  llm_load_print_meta: model params     = 108.89 M
 - llama_engine.cc:475
20241124 11:21:40.068136 UTC 69532270 INFO  llm_load_print_meta: model size       = 208.68 MiB (16.08 BPW)
 - llama_engine.cc:475
20241124 11:21:40.068149 UTC 69532270 INFO  llm_load_print_meta: general.name     = Snowflake Arctic Embed M
 - llama_engine.cc:475
20241124 11:21:40.068162 UTC 69532270 INFO  llm_load_print_meta: UNK token        = 100 '[UNK]'
 - llama_engine.cc:475
20241124 11:21:40.068175 UTC 69532270 INFO  llm_load_print_meta: SEP token        = 102 '[SEP]'
 - llama_engine.cc:475
20241124 11:21:40.068188 UTC 69532270 INFO  llm_load_print_meta: PAD token        = 0 '[PAD]'
 - llama_engine.cc:475
20241124 11:21:40.068202 UTC 69532270 INFO  llm_load_print_meta: CLS token        = 101 '[CLS]'
 - llama_engine.cc:475
20241124 11:21:40.068215 UTC 69532270 INFO  llm_load_print_meta: MASK token       = 103 '[MASK]'
 - llama_engine.cc:475
20241124 11:21:40.068228 UTC 69532270 INFO  llm_load_print_meta: LF token         = 0 '[PAD]'
 - llama_engine.cc:475
20241124 11:21:40.068242 UTC 69532270 INFO  llm_load_print_meta: max token length = 21
 - llama_engine.cc:475
20241124 11:21:40.070025 UTC 69532270 INFO  llm_load_tensors: offloading 12 repeating layers to GPU
 - llama_engine.cc:475
20241124 11:21:40.070056 UTC 69532270 INFO  llm_load_tensors: offloading output layer to GPU
 - llama_engine.cc:475
20241124 11:21:40.070071 UTC 69532270 INFO  llm_load_tensors: offloaded 13/13 layers to GPU
 - llama_engine.cc:475
20241124 11:21:40.070087 UTC 69532270 INFO  llm_load_tensors: Metal_Mapped model buffer size =   162.46 MiB
 - llama_engine.cc:475
20241124 11:21:40.070101 UTC 69532270 INFO  llm_load_tensors:   CPU_Mapped model buffer size =    46.22 MiB
 - llama_engine.cc:475
20241124 11:21:40.070124 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070135 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070144 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070151 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070162 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070171 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070181 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070188 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070199 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070207 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070220 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070229 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070236 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070247 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070257 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070267 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070275 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070285 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070293 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070303 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070313 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070321 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070334 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070344 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070354 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070362 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070374 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070383 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070393 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070403 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070411 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070425 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070444 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070454 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070461 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070472 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070481 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070488 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070497 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070503 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070516 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070523 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070534 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070540 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070552 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070561 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070568 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070577 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070583 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070597 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070610 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070619 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070626 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070638 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070647 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070653 UTC 69532270 INFO
 - llama_engine.cc:475
20241124 11:21:40.071012 UTC 69532270 INFO  llama_new_context_with_model: n_seq_max     = 1
 - llama_engine.cc:475
20241124 11:21:40.071023 UTC 69532270 INFO  llama_new_context_with_model: n_ctx         = 512
 - llama_engine.cc:475
20241124 11:21:40.071036 UTC 69532270 INFO  llama_new_context_with_model: n_ctx_per_seq = 512
 - llama_engine.cc:475
20241124 11:21:40.071046 UTC 69532270 INFO  llama_new_context_with_model: n_batch       = 2048
 - llama_engine.cc:475
20241124 11:21:40.071056 UTC 69532270 INFO  llama_new_context_with_model: n_ubatch      = 2048
 - llama_engine.cc:475
20241124 11:21:40.071066 UTC 69532270 INFO  llama_new_context_with_model: flash_attn    = 1
 - llama_engine.cc:475
20241124 11:21:40.071076 UTC 69532270 INFO  llama_new_context_with_model: freq_base     = 10000.0
 - llama_engine.cc:475
20241124 11:21:40.071087 UTC 69532270 INFO  llama_new_context_with_model: freq_scale    = 1
 - llama_engine.cc:475
20241124 11:21:40.071097 UTC 69532270 INFO  ggml_metal_init: allocating
 - llama_engine.cc:475
20241124 11:21:40.071112 UTC 69532270 INFO  ggml_metal_init: found device: Apple M1 Max
 - llama_engine.cc:475
20241124 11:21:40.071128 UTC 69532270 INFO  ggml_metal_init: picking default device: Apple M1 Max
 - llama_engine.cc:475
20241124 11:21:40.071997 UTC 69532270 INFO  ggml_metal_init: using embedded metal library
 - llama_engine.cc:475
20241124 11:21:40.076377 UTC 69532270 INFO  ggml_metal_init: GPU name:   Apple M1 Max
 - llama_engine.cc:475
20241124 11:21:40.076396 UTC 69532270 INFO  ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
 - llama_engine.cc:475
20241124 11:21:40.076409 UTC 69532270 INFO  ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
 - llama_engine.cc:475
20241124 11:21:40.076421 UTC 69532270 INFO  ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
 - llama_engine.cc:475
20241124 11:21:40.076432 UTC 69532270 INFO  ggml_metal_init: simdgroup reduction   = true
 - llama_engine.cc:475
20241124 11:21:40.076443 UTC 69532270 INFO  ggml_metal_init: simdgroup matrix mul. = true
 - llama_engine.cc:475
20241124 11:21:40.076455 UTC 69532270 INFO  ggml_metal_init: has bfloat            = true
 - llama_engine.cc:475
20241124 11:21:40.076466 UTC 69532270 INFO  ggml_metal_init: use bfloat            = false
 - llama_engine.cc:475
20241124 11:21:40.076477 UTC 69532270 INFO  ggml_metal_init: hasUnifiedMemory      = true
 - llama_engine.cc:475
20241124 11:21:40.076489 UTC 69532270 INFO  ggml_metal_init: recommendedMaxWorkingSetSize  = 22906.50 MB
 - llama_engine.cc:475
20241124 11:21:40.078759 UTC 69532270 WARN  ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
 - llama_engine.cc:473
20241124 11:21:40.079510 UTC 69532270 WARN  ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
 - llama_engine.cc:473
20241124 11:21:40.079530 UTC 69532270 WARN  ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
 - llama_engine.cc:473
20241124 11:21:40.079540 UTC 69532270 WARN  ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
 - llama_engine.cc:473
20241124 11:21:40.079550 UTC 69532270 WARN  ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
 - llama_engine.cc:473
20241124 11:21:40.080422 UTC 69532270 WARN  ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
 - llama_engine.cc:473
20241124 11:21:40.081204 UTC 69532270 WARN  ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
 - llama_engine.cc:473
20241124 11:21:40.081922 UTC 69532270 WARN  ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32                (not supported)
 - llama_engine.cc:473
20241124 11:21:40.083301 UTC 69532270 WARN  ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
 - llama_engine.cc:473
20241124 11:21:40.083315 UTC 69532270 WARN  ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
 - llama_engine.cc:473
20241124 11:21:40.083324 UTC 69532270 WARN  ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
 - llama_engine.cc:473
20241124 11:21:40.083334 UTC 69532270 WARN  ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
 - llama_engine.cc:473
20241124 11:21:40.083344 UTC 69532270 WARN  ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
 - llama_engine.cc:473
20241124 11:21:40.083354 UTC 69532270 WARN  ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
 - llama_engine.cc:473
20241124 11:21:40.084852 UTC 69532270 WARN  ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)
 - llama_engine.cc:473
20241124 11:21:40.085091 UTC 69532270 WARN  ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
 - llama_engine.cc:473
20241124 11:21:40.085347 UTC 69532270 WARN  ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
 - llama_engine.cc:473
20241124 11:21:40.085418 UTC 69532270 WARN  ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
 - llama_engine.cc:473
20241124 11:21:40.085429 UTC 69532270 WARN  ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
 - llama_engine.cc:473
20241124 11:21:40.087354 UTC 69532270 INFO  llama_kv_cache_init:      Metal KV buffer size =    18.00 MiB
 - llama_engine.cc:475
20241124 11:21:40.087379 UTC 69532270 INFO  llama_new_context_with_model: KV self size  =   18.00 MiB, K (f16):    9.00 MiB, V (f16):    9.00 MiB
 - llama_engine.cc:475
20241124 11:21:40.087400 UTC 69532270 INFO  llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
 - llama_engine.cc:475
20241124 11:21:40.088241 UTC 69532270 INFO  llama_new_context_with_model:      Metal compute buffer size =    19.50 MiB
 - llama_engine.cc:475
20241124 11:21:40.088261 UTC 69532270 INFO  llama_new_context_with_model:        CPU compute buffer size =     4.00 MiB
 - llama_engine.cc:475
20241124 11:21:40.088270 UTC 69532270 INFO  llama_new_context_with_model: graph nodes  = 429
 - llama_engine.cc:475
20241124 11:21:40.088279 UTC 69532270 INFO  llama_new_context_with_model: graph splits = 2
 - llama_engine.cc:475
/Users/runner/work/cortex.llamacpp/cortex.llamacpp/llama.cpp/src/llama.cpp:17453: GGML_ASSERT(strcmp(res->name, "result_output") == 0 && "missing result_output tensor") failed

What is your OS?

  • Windows
  • Mac Silicon
  • Mac Intel
  • Linux / Ubuntu

What engine are you running?

  • cortex.llamacpp (default)
  • cortex.tensorrt-llm (Nvidia GPUs)
  • cortex.onnx (NPUs, DirectML)

Hardware Specs eg OS version, GPU

Apple M1 Max, Sonoma 14.7

@grzegorz-bielski grzegorz-bielski added the type: bug Something isn't working label Nov 24, 2024
@github-project-automation github-project-automation bot moved this to Investigating in Jan & Cortex Nov 24, 2024
@louis-jan
Copy link
Contributor

Ah embedding models require additional model load parameters. It is supported via API call but not CLI yet. To update.

@louis-jan louis-jan added the category: tools RAG, function calling, etc label Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: tools RAG, function calling, etc type: bug Something isn't working
Projects
Status: Investigating
Development

No branches or pull requests

3 participants