bug: Cannot start the GGUF model #1719

grzegorz-bielski · 2024-11-24T11:27:01Z

Cortex version

1.0.3

Describe the issue and expected behaviour

I tried to run the https://huggingface.co/yixuan-chia/snowflake-arctic-embed-m-GGUF for the OpenAI compatible embeddings endpoint, but the model couldn't be started - see the reproduction steps and logs below.

Same thing happens for the https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1.

On the side note, I tried to also run the nomic-embed-text-v1 from the built-in models list but running either cortex pull cortexso/nomic-embed-text-v1 or cortex run nomic-embed-text-v1 results in No variant available. Seems like a separate issue though.

Steps to Reproduce

 # the model gets downloaded just fine
cortex pull mixedbread-ai/mxbai-embed-large-v1
# I can access the info about it without any problems
cortex models get yixuan-chia:snowflake-arctic-embed-m-GGUF:snowflake-arctic-embed-m-F16.gguf 
# But it crashes here with `HTTP error: Failed to read connection`
cortex models start yixuan-chia:snowflake-arctic-embed-m-GGUF:snowflake-arctic-embed-m-F16.gguf

Screenshots / Logs

The logs for the last crashing command from ~/cortexcpp/logs/cortex.log (not ~/cortex/logs/ as in ticket template, maybe it has to be updated)

20241124 11:21:38.970271 UTC 69532246 INFO  Host: 127.0.0.1 Port: 39281
 - main.cc:80
20241124 11:21:38.971643 UTC 69532246 INFO  cortex.cpp version: v1.0.3 - main.cc:89
20241124 11:21:38.981916 UTC 69532246 INFO  nvidia-smi is not available! - system_info_utils.h:130
20241124 11:21:38.983383 UTC 69532246 INFO  Activated GPUs before:  - hardware_service.cc:244
20241124 11:21:38.983435 UTC 69532246 INFO  Activated GPUs after:  - hardware_service.cc:268
20241124 11:21:38.986450 UTC 69532246 INFO  Starting worker thread: 0 - download_service.cc:302
20241124 11:21:38.986504 UTC 69532246 INFO  Starting worker thread: 1 - download_service.cc:302
20241124 11:21:38.986532 UTC 69532246 INFO  Starting worker thread: 2 - download_service.cc:302
20241124 11:21:38.986555 UTC 69532246 INFO  Starting worker thread: 3 - download_service.cc:302
20241124 11:21:38.991551 UTC 69532246 INFO  nvidia-smi is not available! - system_info_utils.h:130
20241124 11:21:38.994464 UTC 69532246 INFO  Server started, listening at: 127.0.0.1:39281 - main.cc:140
20241124 11:21:38.994479 UTC 69532246 INFO  Please load your model - main.cc:142
20241124 11:21:38.994494 UTC 69532246 INFO  Number of thread is:10 - main.cc:149
20241124 11:21:39.946448 UTC 69532269 INFO  Origin:  - main.cc:162
20241124 11:21:39.959655 UTC 69532270 INFO  {
	"ai_prompt" : "[/INST]",
	"ai_template" : "[/INST]",
	"created" : 0,
	"ctx_len" : 512,
	"dynatemp_exponent" : 1.0,
	"dynatemp_range" : 0.0,
	"engine" : "llama-cpp",
	"files" :
	[
		"models/huggingface.co/yixuan-chia/snowflake-arctic-embed-m-GGUF/snowflake-arctic-embed-m-F16.gguf"
	],
	"frequency_penalty" : 0.0,
	"gpu_arch" : "",
	"ignore_eos" : false,
	"max_tokens" : 512,
	"min_keep" : 0,
	"min_p" : 0.05000000074505806,
	"mirostat" : false,
	"mirostat_eta" : 0.10000000149011612,
	"mirostat_tau" : 5.0,
	"model" : "yixuan-chia:snowflake-arctic-embed-m-GGUF:snowflake-arctic-embed-m-F16.gguf",
	"model_path" : "/Users/grzegorzbielski/cortexcpp/models/huggingface.co/yixuan-chia/snowflake-arctic-embed-m-GGUF/snowflake-arctic-embed-m-F16.gguf",
	"n_parallel" : 1,
	"n_probs" : 0,
	"name" : "Snowflake-Arctic-Embed-M",
	"ngl" : 13,
	"object" : "",
	"os" : "",
	"owned_by" : "",
	"penalize_nl" : false,
	"precision" : "",
	"presence_penalty" : 0.0,
	"prompt_template" : "[INST] <<SYS>>\n{system_message}\n<</SYS>>\n{prompt}[/INST]",
	"quantization_method" : "",
	"repeat_last_n" : 64,
	"repeat_penalty" : 1.0,
	"seed" : -1,
	"size" : 0,
	"stop" :
	[
		"[PAD]"
	],
	"stream" : true,
	"system_prompt" : "[INST] <<SYS>>\n",
	"system_template" : "[INST] <<SYS>>\n",
	"temperature" : 0.69999998807907104,
	"text_model" : false,
	"tfs_z" : 1.0,
	"top_k" : 40,
	"top_p" : 0.94999998807907104,
	"typ_p" : 1.0,
	"user_prompt" : "\n<</SYS>>\n",
	"user_template" : "\n<</SYS>>\n",
	"version" : "2"
}
 - model_service.cc:667
20241124 11:21:39.974798 UTC 69532270 INFO  nvidia-smi is not available! - system_info_utils.h:130
20241124 11:21:39.976185 UTC 69532270 INFO  is_cuda: 0 - model_service.cc:680
20241124 11:21:39.976265 UTC 69532270 INFO  Loading engine: cortex.llamacpp - engine_service.cc:771
20241124 11:21:39.977503 UTC 69532270 INFO  Selected engine variant: {"engine":"cortex.llamacpp","variant":"mac-arm64","version":"v0.1.39"} - engine_service.cc:780
20241124 11:21:39.978785 UTC 69532270 INFO  Engine path: /Users/grzegorzbielski/cortexcpp/engines/cortex.llamacpp/mac-arm64/v0.1.39 - engine_service.cc:805
20241124 11:21:39.983962 UTC 69532270 INFO  cortex.llamacpp version: 0.1.39 - llama_engine.cc:308
20241124 11:21:39.984439 UTC 69532270 INFO  Number of parallel is set to 1 - llama_engine.cc:544
20241124 11:21:39.984490 UTC 69532270 INFO  system info: {'n_thread': 10, 'total_threads': 10. 'system_info': 'AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | '} - llama_engine.cc:620
20241124 11:21:40.052148 UTC 69532270 INFO  llama_load_model_from_file: using device Metal (Apple M1 Max) - 21845 MiB free
 - llama_engine.cc:475
20241124 11:21:40.055512 UTC 69532270 INFO  llama_model_loader: loaded meta data with 26 key-value pairs and 197 tensors from /Users/grzegorzbielski/cortexcpp/models/huggingface.co/yixuan-chia/snowflake-arctic-embed-m-GGUF/snowflake-arctic-embed-m-F16.gguf (version GGUF V3 (latest))
 - llama_engine.cc:475
20241124 11:21:40.055544 UTC 69532270 INFO  llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
 - llama_engine.cc:475
20241124 11:21:40.055562 UTC 69532270 INFO  llama_model_loader: - kv   0:                       general.architecture str              = bert
 - llama_engine.cc:475
20241124 11:21:40.055577 UTC 69532270 INFO  llama_model_loader: - kv   1:                               general.type str              = model
 - llama_engine.cc:475
20241124 11:21:40.055592 UTC 69532270 INFO  llama_model_loader: - kv   2:                               general.name str              = Snowflake Arctic Embed M
 - llama_engine.cc:475
20241124 11:21:40.055607 UTC 69532270 INFO  llama_model_loader: - kv   3:                         general.size_label str              = 109M
 - llama_engine.cc:475
20241124 11:21:40.055621 UTC 69532270 INFO  llama_model_loader: - kv   4:                            general.license str              = apache-2.0
 - llama_engine.cc:475
20241124 11:21:40.055640 UTC 69532270 INFO  llama_model_loader: - kv   5:                               general.tags arr[str,8]       = ["sentence-transformers", "feature-ex...
 - llama_engine.cc:475
20241124 11:21:40.055655 UTC 69532270 INFO  llama_model_loader: - kv   6:                           bert.block_count u32              = 12
 - llama_engine.cc:475
20241124 11:21:40.055669 UTC 69532270 INFO  llama_model_loader: - kv   7:                        bert.context_length u32              = 512
 - llama_engine.cc:475
20241124 11:21:40.055683 UTC 69532270 INFO  llama_model_loader: - kv   8:                      bert.embedding_length u32              = 768
 - llama_engine.cc:475
20241124 11:21:40.055697 UTC 69532270 INFO  llama_model_loader: - kv   9:                   bert.feed_forward_length u32              = 3072
 - llama_engine.cc:475
20241124 11:21:40.055712 UTC 69532270 INFO  llama_model_loader: - kv  10:                  bert.attention.head_count u32              = 12
 - llama_engine.cc:475
20241124 11:21:40.055728 UTC 69532270 INFO  llama_model_loader: - kv  11:          bert.attention.layer_norm_epsilon f32              = 0.000000
 - llama_engine.cc:475
20241124 11:21:40.055746 UTC 69532270 INFO  llama_model_loader: - kv  12:                          general.file_type u32              = 1
 - llama_engine.cc:475
20241124 11:21:40.055761 UTC 69532270 INFO  llama_model_loader: - kv  13:                      bert.attention.causal bool             = false
 - llama_engine.cc:475
20241124 11:21:40.055776 UTC 69532270 INFO  llama_model_loader: - kv  14:                          bert.pooling_type u32              = 2
 - llama_engine.cc:475
20241124 11:21:40.055792 UTC 69532270 INFO  llama_model_loader: - kv  15:            tokenizer.ggml.token_type_count u32              = 2
 - llama_engine.cc:475
20241124 11:21:40.055808 UTC 69532270 INFO  llama_model_loader: - kv  16:                       tokenizer.ggml.model str              = bert
 - llama_engine.cc:475
20241124 11:21:40.055823 UTC 69532270 INFO  llama_model_loader: - kv  17:                         tokenizer.ggml.pre str              = jina-v2-en
 - llama_engine.cc:475
20241124 11:21:40.060101 UTC 69532270 INFO  llama_model_loader: - kv  18:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
 - llama_engine.cc:475
20241124 11:21:40.061260 UTC 69532270 INFO  llama_model_loader: - kv  19:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
 - llama_engine.cc:475
20241124 11:21:40.061277 UTC 69532270 INFO  llama_model_loader: - kv  20:            tokenizer.ggml.unknown_token_id u32              = 100
 - llama_engine.cc:475
20241124 11:21:40.061291 UTC 69532270 INFO  llama_model_loader: - kv  21:          tokenizer.ggml.seperator_token_id u32              = 102
 - llama_engine.cc:475
20241124 11:21:40.061304 UTC 69532270 INFO  llama_model_loader: - kv  22:            tokenizer.ggml.padding_token_id u32              = 0
 - llama_engine.cc:475
20241124 11:21:40.061318 UTC 69532270 INFO  llama_model_loader: - kv  23:                tokenizer.ggml.cls_token_id u32              = 101
 - llama_engine.cc:475
20241124 11:21:40.061332 UTC 69532270 INFO  llama_model_loader: - kv  24:               tokenizer.ggml.mask_token_id u32              = 103
 - llama_engine.cc:475
20241124 11:21:40.061346 UTC 69532270 INFO  llama_model_loader: - kv  25:               general.quantization_version u32              = 2
 - llama_engine.cc:475
20241124 11:21:40.061360 UTC 69532270 INFO  llama_model_loader: - type  f32:  124 tensors
 - llama_engine.cc:475
20241124 11:21:40.061373 UTC 69532270 INFO  llama_model_loader: - type  f16:   73 tensors
 - llama_engine.cc:475
20241124 11:21:40.065239 UTC 69532270 INFO  llm_load_vocab: special tokens cache size = 5
 - llama_engine.cc:475
20241124 11:21:40.067499 UTC 69532270 INFO  llm_load_vocab: token to piece cache size = 0.2032 MB
 - llama_engine.cc:475
20241124 11:21:40.067551 UTC 69532270 INFO  llm_load_print_meta: format           = GGUF V3 (latest)
 - llama_engine.cc:475
20241124 11:21:40.067567 UTC 69532270 INFO  llm_load_print_meta: arch             = bert
 - llama_engine.cc:475
20241124 11:21:40.067585 UTC 69532270 INFO  llm_load_print_meta: vocab type       = WPM
 - llama_engine.cc:475
20241124 11:21:40.067598 UTC 69532270 INFO  llm_load_print_meta: n_vocab          = 30522
 - llama_engine.cc:475
20241124 11:21:40.067612 UTC 69532270 INFO  llm_load_print_meta: n_merges         = 0
 - llama_engine.cc:475
20241124 11:21:40.067625 UTC 69532270 INFO  llm_load_print_meta: vocab_only       = 0
 - llama_engine.cc:475
20241124 11:21:40.067638 UTC 69532270 INFO  llm_load_print_meta: n_ctx_train      = 512
 - llama_engine.cc:475
20241124 11:21:40.067653 UTC 69532270 INFO  llm_load_print_meta: n_embd           = 768
 - llama_engine.cc:475
20241124 11:21:40.067667 UTC 69532270 INFO  llm_load_print_meta: n_layer          = 12
 - llama_engine.cc:475
20241124 11:21:40.067685 UTC 69532270 INFO  llm_load_print_meta: n_head           = 12
 - llama_engine.cc:475
20241124 11:21:40.067700 UTC 69532270 INFO  llm_load_print_meta: n_head_kv        = 12
 - llama_engine.cc:475
20241124 11:21:40.067714 UTC 69532270 INFO  llm_load_print_meta: n_rot            = 64
 - llama_engine.cc:475
20241124 11:21:40.067728 UTC 69532270 INFO  llm_load_print_meta: n_swa            = 0
 - llama_engine.cc:475
20241124 11:21:40.067741 UTC 69532270 INFO  llm_load_print_meta: n_embd_head_k    = 64
 - llama_engine.cc:475
20241124 11:21:40.067755 UTC 69532270 INFO  llm_load_print_meta: n_embd_head_v    = 64
 - llama_engine.cc:475
20241124 11:21:40.067769 UTC 69532270 INFO  llm_load_print_meta: n_gqa            = 1
 - llama_engine.cc:475
20241124 11:21:40.067784 UTC 69532270 INFO  llm_load_print_meta: n_embd_k_gqa     = 768
 - llama_engine.cc:475
20241124 11:21:40.067797 UTC 69532270 INFO  llm_load_print_meta: n_embd_v_gqa     = 768
 - llama_engine.cc:475
20241124 11:21:40.067810 UTC 69532270 INFO  llm_load_print_meta: f_norm_eps       = 1.0e-12
 - llama_engine.cc:475
20241124 11:21:40.067822 UTC 69532270 INFO  llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
 - llama_engine.cc:475
20241124 11:21:40.067834 UTC 69532270 INFO  llm_load_print_meta: f_clamp_kqv      = 0.0e+00
 - llama_engine.cc:475
20241124 11:21:40.067846 UTC 69532270 INFO  llm_load_print_meta: f_max_alibi_bias = 0.0e+00
 - llama_engine.cc:475
20241124 11:21:40.067863 UTC 69532270 INFO  llm_load_print_meta: f_logit_scale    = 0.0e+00
 - llama_engine.cc:475
20241124 11:21:40.067876 UTC 69532270 INFO  llm_load_print_meta: n_ff             = 3072
 - llama_engine.cc:475
20241124 11:21:40.067889 UTC 69532270 INFO  llm_load_print_meta: n_expert         = 0
 - llama_engine.cc:475
20241124 11:21:40.067903 UTC 69532270 INFO  llm_load_print_meta: n_expert_used    = 0
 - llama_engine.cc:475
20241124 11:21:40.067917 UTC 69532270 INFO  llm_load_print_meta: causal attn      = 0
 - llama_engine.cc:475
20241124 11:21:40.067931 UTC 69532270 INFO  llm_load_print_meta: pooling type     = 2
 - llama_engine.cc:475
20241124 11:21:40.067945 UTC 69532270 INFO  llm_load_print_meta: rope type        = 2
 - llama_engine.cc:475
20241124 11:21:40.067958 UTC 69532270 INFO  llm_load_print_meta: rope scaling     = linear
 - llama_engine.cc:475
20241124 11:21:40.067972 UTC 69532270 INFO  llm_load_print_meta: freq_base_train  = 10000.0
 - llama_engine.cc:475
20241124 11:21:40.067986 UTC 69532270 INFO  llm_load_print_meta: freq_scale_train = 1
 - llama_engine.cc:475
20241124 11:21:40.068000 UTC 69532270 INFO  llm_load_print_meta: n_ctx_orig_yarn  = 512
 - llama_engine.cc:475
20241124 11:21:40.068014 UTC 69532270 INFO  llm_load_print_meta: rope_finetuned   = unknown
 - llama_engine.cc:475
20241124 11:21:40.068027 UTC 69532270 INFO  llm_load_print_meta: ssm_d_conv       = 0
 - llama_engine.cc:475
20241124 11:21:40.068039 UTC 69532270 INFO  llm_load_print_meta: ssm_d_inner      = 0
 - llama_engine.cc:475
20241124 11:21:40.068051 UTC 69532270 INFO  llm_load_print_meta: ssm_d_state      = 0
 - llama_engine.cc:475
20241124 11:21:40.068063 UTC 69532270 INFO  llm_load_print_meta: ssm_dt_rank      = 0
 - llama_engine.cc:475
20241124 11:21:40.068075 UTC 69532270 INFO  llm_load_print_meta: ssm_dt_b_c_rms   = 0
 - llama_engine.cc:475
20241124 11:21:40.068090 UTC 69532270 INFO  llm_load_print_meta: model type       = 109M
 - llama_engine.cc:475
20241124 11:21:40.068108 UTC 69532270 INFO  llm_load_print_meta: model ftype      = F16
 - llama_engine.cc:475
20241124 11:21:40.068120 UTC 69532270 INFO  llm_load_print_meta: model params     = 108.89 M
 - llama_engine.cc:475
20241124 11:21:40.068136 UTC 69532270 INFO  llm_load_print_meta: model size       = 208.68 MiB (16.08 BPW)
 - llama_engine.cc:475
20241124 11:21:40.068149 UTC 69532270 INFO  llm_load_print_meta: general.name     = Snowflake Arctic Embed M
 - llama_engine.cc:475
20241124 11:21:40.068162 UTC 69532270 INFO  llm_load_print_meta: UNK token        = 100 '[UNK]'
 - llama_engine.cc:475
20241124 11:21:40.068175 UTC 69532270 INFO  llm_load_print_meta: SEP token        = 102 '[SEP]'
 - llama_engine.cc:475
20241124 11:21:40.068188 UTC 69532270 INFO  llm_load_print_meta: PAD token        = 0 '[PAD]'
 - llama_engine.cc:475
20241124 11:21:40.068202 UTC 69532270 INFO  llm_load_print_meta: CLS token        = 101 '[CLS]'
 - llama_engine.cc:475
20241124 11:21:40.068215 UTC 69532270 INFO  llm_load_print_meta: MASK token       = 103 '[MASK]'
 - llama_engine.cc:475
20241124 11:21:40.068228 UTC 69532270 INFO  llm_load_print_meta: LF token         = 0 '[PAD]'
 - llama_engine.cc:475
20241124 11:21:40.068242 UTC 69532270 INFO  llm_load_print_meta: max token length = 21
 - llama_engine.cc:475
20241124 11:21:40.070025 UTC 69532270 INFO  llm_load_tensors: offloading 12 repeating layers to GPU
 - llama_engine.cc:475
20241124 11:21:40.070056 UTC 69532270 INFO  llm_load_tensors: offloading output layer to GPU
 - llama_engine.cc:475
20241124 11:21:40.070071 UTC 69532270 INFO  llm_load_tensors: offloaded 13/13 layers to GPU
 - llama_engine.cc:475
20241124 11:21:40.070087 UTC 69532270 INFO  llm_load_tensors: Metal_Mapped model buffer size =   162.46 MiB
 - llama_engine.cc:475
20241124 11:21:40.070101 UTC 69532270 INFO  llm_load_tensors:   CPU_Mapped model buffer size =    46.22 MiB
 - llama_engine.cc:475
20241124 11:21:40.070124 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070135 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070144 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070151 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070162 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070171 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070181 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070188 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070199 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070207 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070220 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070229 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070236 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070247 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070257 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070267 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070275 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070285 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070293 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070303 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070313 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070321 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070334 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070344 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070354 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070362 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070374 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070383 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070393 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070403 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070411 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070425 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070444 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070454 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070461 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070472 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070481 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070488 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070497 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070503 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070516 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070523 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070534 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070540 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070552 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070561 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070568 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070577 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070583 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070597 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070610 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070619 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070626 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070638 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070647 UTC 69532270 INFO  . - llama_engine.cc:475
20241124 11:21:40.070653 UTC 69532270 INFO
 - llama_engine.cc:475
20241124 11:21:40.071012 UTC 69532270 INFO  llama_new_context_with_model: n_seq_max     = 1
 - llama_engine.cc:475
20241124 11:21:40.071023 UTC 69532270 INFO  llama_new_context_with_model: n_ctx         = 512
 - llama_engine.cc:475
20241124 11:21:40.071036 UTC 69532270 INFO  llama_new_context_with_model: n_ctx_per_seq = 512
 - llama_engine.cc:475
20241124 11:21:40.071046 UTC 69532270 INFO  llama_new_context_with_model: n_batch       = 2048
 - llama_engine.cc:475
20241124 11:21:40.071056 UTC 69532270 INFO  llama_new_context_with_model: n_ubatch      = 2048
 - llama_engine.cc:475
20241124 11:21:40.071066 UTC 69532270 INFO  llama_new_context_with_model: flash_attn    = 1
 - llama_engine.cc:475
20241124 11:21:40.071076 UTC 69532270 INFO  llama_new_context_with_model: freq_base     = 10000.0
 - llama_engine.cc:475
20241124 11:21:40.071087 UTC 69532270 INFO  llama_new_context_with_model: freq_scale    = 1
 - llama_engine.cc:475
20241124 11:21:40.071097 UTC 69532270 INFO  ggml_metal_init: allocating
 - llama_engine.cc:475
20241124 11:21:40.071112 UTC 69532270 INFO  ggml_metal_init: found device: Apple M1 Max
 - llama_engine.cc:475
20241124 11:21:40.071128 UTC 69532270 INFO  ggml_metal_init: picking default device: Apple M1 Max
 - llama_engine.cc:475
20241124 11:21:40.071997 UTC 69532270 INFO  ggml_metal_init: using embedded metal library
 - llama_engine.cc:475
20241124 11:21:40.076377 UTC 69532270 INFO  ggml_metal_init: GPU name:   Apple M1 Max
 - llama_engine.cc:475
20241124 11:21:40.076396 UTC 69532270 INFO  ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
 - llama_engine.cc:475
20241124 11:21:40.076409 UTC 69532270 INFO  ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
 - llama_engine.cc:475
20241124 11:21:40.076421 UTC 69532270 INFO  ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
 - llama_engine.cc:475
20241124 11:21:40.076432 UTC 69532270 INFO  ggml_metal_init: simdgroup reduction   = true
 - llama_engine.cc:475
20241124 11:21:40.076443 UTC 69532270 INFO  ggml_metal_init: simdgroup matrix mul. = true
 - llama_engine.cc:475
20241124 11:21:40.076455 UTC 69532270 INFO  ggml_metal_init: has bfloat            = true
 - llama_engine.cc:475
20241124 11:21:40.076466 UTC 69532270 INFO  ggml_metal_init: use bfloat            = false
 - llama_engine.cc:475
20241124 11:21:40.076477 UTC 69532270 INFO  ggml_metal_init: hasUnifiedMemory      = true
 - llama_engine.cc:475
20241124 11:21:40.076489 UTC 69532270 INFO  ggml_metal_init: recommendedMaxWorkingSetSize  = 22906.50 MB
 - llama_engine.cc:475
20241124 11:21:40.078759 UTC 69532270 WARN  ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
 - llama_engine.cc:473
20241124 11:21:40.079510 UTC 69532270 WARN  ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
 - llama_engine.cc:473
20241124 11:21:40.079530 UTC 69532270 WARN  ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
 - llama_engine.cc:473
20241124 11:21:40.079540 UTC 69532270 WARN  ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
 - llama_engine.cc:473
20241124 11:21:40.079550 UTC 69532270 WARN  ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
 - llama_engine.cc:473
20241124 11:21:40.080422 UTC 69532270 WARN  ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
 - llama_engine.cc:473
20241124 11:21:40.081204 UTC 69532270 WARN  ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
 - llama_engine.cc:473
20241124 11:21:40.081922 UTC 69532270 WARN  ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32                (not supported)
 - llama_engine.cc:473
20241124 11:21:40.083301 UTC 69532270 WARN  ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
 - llama_engine.cc:473
20241124 11:21:40.083315 UTC 69532270 WARN  ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
 - llama_engine.cc:473
20241124 11:21:40.083324 UTC 69532270 WARN  ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
 - llama_engine.cc:473
20241124 11:21:40.083334 UTC 69532270 WARN  ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
 - llama_engine.cc:473
20241124 11:21:40.083344 UTC 69532270 WARN  ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
 - llama_engine.cc:473
20241124 11:21:40.083354 UTC 69532270 WARN  ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
 - llama_engine.cc:473
20241124 11:21:40.084852 UTC 69532270 WARN  ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)
 - llama_engine.cc:473
20241124 11:21:40.085091 UTC 69532270 WARN  ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
 - llama_engine.cc:473
20241124 11:21:40.085347 UTC 69532270 WARN  ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
 - llama_engine.cc:473
20241124 11:21:40.085418 UTC 69532270 WARN  ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
 - llama_engine.cc:473
20241124 11:21:40.085429 UTC 69532270 WARN  ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
 - llama_engine.cc:473
20241124 11:21:40.087354 UTC 69532270 INFO  llama_kv_cache_init:      Metal KV buffer size =    18.00 MiB
 - llama_engine.cc:475
20241124 11:21:40.087379 UTC 69532270 INFO  llama_new_context_with_model: KV self size  =   18.00 MiB, K (f16):    9.00 MiB, V (f16):    9.00 MiB
 - llama_engine.cc:475
20241124 11:21:40.087400 UTC 69532270 INFO  llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
 - llama_engine.cc:475
20241124 11:21:40.088241 UTC 69532270 INFO  llama_new_context_with_model:      Metal compute buffer size =    19.50 MiB
 - llama_engine.cc:475
20241124 11:21:40.088261 UTC 69532270 INFO  llama_new_context_with_model:        CPU compute buffer size =     4.00 MiB
 - llama_engine.cc:475
20241124 11:21:40.088270 UTC 69532270 INFO  llama_new_context_with_model: graph nodes  = 429
 - llama_engine.cc:475
20241124 11:21:40.088279 UTC 69532270 INFO  llama_new_context_with_model: graph splits = 2
 - llama_engine.cc:475
/Users/runner/work/cortex.llamacpp/cortex.llamacpp/llama.cpp/src/llama.cpp:17453: GGML_ASSERT(strcmp(res->name, "result_output") == 0 && "missing result_output tensor") failed

What is your OS?

Windows
Mac Silicon
Mac Intel
Linux / Ubuntu

What engine are you running?

cortex.llamacpp (default)
cortex.tensorrt-llm (Nvidia GPUs)
cortex.onnx (NPUs, DirectML)

Hardware Specs eg OS version, GPU

Apple M1 Max, Sonoma 14.7

The text was updated successfully, but these errors were encountered:

louis-jan · 2024-12-04T02:32:08Z

Ah embedding models require additional model load parameters. It is supported via API call but not CLI yet. To update.

grzegorz-bielski added the type: bug Something isn't working label Nov 24, 2024

github-project-automation bot added this to Jan & Cortex Nov 24, 2024

github-project-automation bot moved this to Investigating in Jan & Cortex Nov 24, 2024

louis-jan added the category: tools RAG, function calling, etc label Dec 4, 2024

louis-jan assigned vansangpfiev Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: Cannot start the GGUF model #1719

bug: Cannot start the GGUF model #1719

grzegorz-bielski commented Nov 24, 2024 •

edited

Loading

louis-jan commented Dec 4, 2024

bug: Cannot start the GGUF model #1719

bug: Cannot start the GGUF model #1719

Comments

grzegorz-bielski commented Nov 24, 2024 • edited Loading

Cortex version

Describe the issue and expected behaviour

Steps to Reproduce

Screenshots / Logs

What is your OS?

What engine are you running?

Hardware Specs eg OS version, GPU

louis-jan commented Dec 4, 2024

grzegorz-bielski commented Nov 24, 2024 •

edited

Loading