Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Certain characters crash bitnet model inference? #102

Open
grctest opened this issue Nov 1, 2024 · 6 comments
Open

Certain characters crash bitnet model inference? #102

grctest opened this issue Nov 1, 2024 · 6 comments

Comments

@grctest
Copy link

grctest commented Nov 1, 2024

I've been working on securing the user input, escaping invalid characters, however I've encountered a few prompts which cause the llama-cli to abruptly halt:

.\llama-cli.exe --model "..\..\..\models\Llama3-8B-1.58-100B-tokens\ggml-model-i2_s.gguf" --prompt "£" --threads 2 -c 2048 -n 20 -ngl 0 --temp 0.8
...
system_info: n_threads = 2 (n_threads_batch = 2) / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |

The command abruptly halts at system_info, offering no further logs.

This also occurs for --prompt "¬" and some other advanced unicode characters.

Where as it works for normal characters:

.\llama-cli.exe --model "..\..\..\models\Llama3-8B-1.58-100B-tokens\ggml-model-i2_s.gguf" --prompt "a" --threads 2 -c 2048 -n 20 -ngl 0 --temp 0.8
...
system_info: n_threads = 2 (n_threads_batch = 2) / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |

sampler seed: 3436479236
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> temp-ext -> softmax -> dist
generate: n_ctx = 2048, n_batch = 2048, n_predict = 20, n_keep = 1

a, or first- 1,000 days of life, it's not that the first 1

llama_perf_sampler_print:    sampling time =       2.02 ms /    22 runs   (    0.09 ms per token, 10918.11 tokens per second)
llama_perf_context_print:        load time =    1092.28 ms
llama_perf_context_print: prompt eval time =     133.19 ms /     2 tokens (   66.60 ms per token,    15.02 tokens per second)
llama_perf_context_print:        eval time =    1831.67 ms /    19 runs   (   96.40 ms per token,    10.37 tokens per second)
llama_perf_context_print:       total time =    1970.46 ms /    21 tokens

Is this because the unicode tokens haven't been tokenized in the model? Or is there a llamma unicode fix for this issue?

I could probably filter out the incompatible characters from the user prompt input if this cannot be worked around, is there a list of incompatible characters?

Thanks

@kth8
Copy link

kth8 commented Nov 7, 2024

I tried your characters using a Docker container and didn't encounter any errors.

$ docker run --rm ghcr.io/kth8/bitnet build/bin/llama-cli --model Llama3-8B-1.58-100B-tokens-TQ2_0.gguf --prompt "£"
...
system_info: n_threads = 4 (n_threads_batch = 4) / 8 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 

sampler seed: 1941399586
sampler params: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> temp-ext -> softmax -> dist 
generate: n_ctx = 8192, n_batch = 2048, n_predict = -1, n_keep = 1

£5.5m, 1995 – 2014
- 2015 – 2025This is a new page, I have just begun writing for it. I have no plans to have it completed, but I would like to add to it. I will also add to other pages.
The following is a summary of what I am trying to do with the page.
This page is intended to be a summary of what we are doing to reduce the use of fossil fuels.
We are doing this by making the use of the fuels available to us, but not being able to do this, by reducing the use of the fossil fuels, we are able to do it. But the use of the fossil fuels is still too much.
We are still in the process of finding ways to reduce the use of the fossil fuels, and in the process of finding ways to reduce the use of the fossil fuels, we are still in the process of finding ways to reduce the use of the fossil fuels.
I will not be able to complete this page, but I would like to add to it.

llama_perf_sampler_print:    sampling time =      44.89 ms /   287 runs   (    0.16 ms per token,  6393.98 tokens per second)
llama_perf_context_print:        load time =    1439.78 ms
llama_perf_context_print: prompt eval time =     155.36 ms /     2 tokens (   77.68 ms per token,    12.87 tokens per second)
llama_perf_context_print:        eval time =   34984.67 ms /   284 runs   (  123.19 ms per token,     8.12 tokens per second)
llama_perf_context_print:       total time =   35372.84 ms /   286 tokens

@grctest
Copy link
Author

grctest commented Nov 7, 2024

@kth8 Thanks for testing this out, I notice a difference of Llama3-8B-1.58-100B-tokens-TQ2_0.gguf versus my Llama3-8B-1.58-100B-tokens\ggml-model-i2_s.gguf, how did you get the TQ2_0 instead of i2_s model? Did you use --quant-type tl2 during the setup phase?

I re-ran the setup script, and now at least I get the following error message:

Error occurred while running command: Command '['build\\bin\\Release\\llama-cli.exe', '-m', 'models/Llama3-8B-1.58-100B-tokens/ggml-model-i2_s.gguf', '-n', '6', '-t', '2', '-p', '£', '-ngl', '0', '-c', '2048', '--temp', '0.0', '-b', '1']' returned non-zero exit status 3221226505.

I'm downloading the other 2 reference 1-bit models and will test the same issue with them.

@kth8
Copy link

kth8 commented Nov 7, 2024

Code is in my repo if you want to take a look: https://github.com/kth8/bitnet

@grctest
Copy link
Author

grctest commented Nov 7, 2024

The model build readme doesn't list tl2 as an option for quant_type {i2_s,tl1}, however you're right that the setup_env.py file supports tl2 too, so trying the following:

python setup_env.py --hf-repo HF1BitLLM/Llama3-8B-1.58-100B-tokens -q tl2

Then running the command:
python run_inference.py -m models/Llama3-8B-1.58-100B-tokens/ggml-model-tl2.gguf -p "£"

Results in:

warning: not compiled with GPU offload support, --gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support
build: 3947 (406a5036) with Clang 17.0.3 for x64
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from models/Llama3-8B-1.58-100B-tokens/ggml-model-tl2.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Llama3-8B-1.58-100B-tokens
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 39
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:    2 tensors
llama_model_loader: - type  tl2:  224 tensors
...
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.8000 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = TL2
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 3.33 GiB (3.56 BPW)
llm_load_print_meta: general.name     = Llama3-8B-1.58-100B-tokens
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.14 MiB
llm_load_tensors:        CPU buffer size =  3405.69 MiB
............................................
llama_new_context_with_model: n_batch is less than GGML_KQ_MASK_PAD - increasing to 32
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 32
llama_new_context_with_model: n_ubatch   = 32
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:        CPU compute buffer size =    16.16 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 2

Error occurred while running command: Command '['build\\bin\\Release\\llama-cli.exe', '-m', 'models/Llama3-8B-1.58-100B-tokens/ggml-model-tl2.gguf', '-n', '6', '-t', '2', '-p', '£', '-ngl', '0', '-c', '2048', '--temp', '0.0', '-b', '1']' returned non-zero exit status 3221226505.

Where as trying out the model "bitnet_b1_58-3B\ggml-model-i2_s.gguf" results in:

python run_inference.py -m models\bitnet_b1_58-3B\ggml-model-i2_s.gguf -p "£" -n 20 -temp 0.8
warning: not compiled with GPU offload support, --gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support
build: 3947 (406a5036) with Clang 17.0.3 for x64
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 26 key-value pairs and 288 tensors from models\bitnet_b1_58-3B\ggml-model-i2_s.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = bitnet
llama_model_loader: - kv   1:                               general.name str              = bitnet_b1_58-3B
llama_model_loader: - kv   2:                         bitnet.block_count u32              = 26
llama_model_loader: - kv   3:                      bitnet.context_length u32              = 2048
llama_model_loader: - kv   4:                    bitnet.embedding_length u32              = 3200
llama_model_loader: - kv   5:                 bitnet.feed_forward_length u32              = 8640
llama_model_loader: - kv   6:                bitnet.attention.head_count u32              = 32
llama_model_loader: - kv   7:             bitnet.attention.head_count_kv u32              = 32
llama_model_loader: - kv   8:                      bitnet.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv   9:    bitnet.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 40
llama_model_loader: - kv  11:                          bitnet.vocab_size u32              = 32002
llama_model_loader: - kv  12:                   bitnet.rope.scaling.type str              = linear
llama_model_loader: - kv  13:                 bitnet.rope.scaling.factor f32              = 1.000000
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,32002]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,32002]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,32002]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  21:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  22:            tokenizer.ggml.padding_token_id u32              = 32000
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  24:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  105 tensors
llama_model_loader: - type q8_0:    1 tensors
llama_model_loader: - type i2_s:  182 tensors
llm_load_vocab: control token:      2 '</s>' is not marked as EOG
llm_load_vocab: control token:      1 '<s>' is not marked as EOG
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 5
llm_load_vocab: token to piece cache size = 0.1684 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = bitnet
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32002
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 3200
llm_load_print_meta: n_layer          = 26
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_rot            = 100
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 100
llm_load_print_meta: n_embd_head_v    = 100
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 3200
llm_load_print_meta: n_embd_v_gqa     = 3200
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8640
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = I2_S - 2 bpw ternary
llm_load_print_meta: model params     = 3.32 B
llm_load_print_meta: model size       = 873.66 MiB (2.20 BPW)
llm_load_print_meta: general.name     = bitnet_b1_58-3B
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 32000 '</line>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOG token        = 2 '</s>'
llm_load_print_meta: max token length = 48
llm_load_tensors: ggml ctx size =    0.13 MiB
llm_load_tensors:        CPU buffer size =   873.66 MiB
..........................................................................................
llama_new_context_with_model: n_batch is less than GGML_KQ_MASK_PAD - increasing to 32
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 32
llama_new_context_with_model: n_ubatch   = 32
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   650.00 MiB
llama_new_context_with_model: KV self size  =  650.00 MiB, K (f16):  325.00 MiB, V (f16):  325.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
llama_new_context_with_model:        CPU compute buffer size =     9.81 MiB
llama_new_context_with_model: graph nodes  = 942
llama_new_context_with_model: graph splits = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 2

system_info: n_threads = 2 (n_threads_batch = 2) / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |

sampler seed: 1279925761
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> temp-ext -> softmax -> dist
generate: n_ctx = 2048, n_batch = 1, n_predict = 20, n_keep = 1

 � < ch and or under,. to on,b this and j


 g. good

llama_perf_sampler_print:    sampling time =       0.63 ms /    23 runs   (    0.03 ms per token, 36392.41 tokens per second)
llama_perf_context_print:        load time =     361.70 ms
llama_perf_context_print: prompt eval time =     111.17 ms /     3 tokens (   37.06 ms per token,    26.99 tokens per second)
llama_perf_context_print:        eval time =     673.42 ms /    19 runs   (   35.44 ms per token,    28.21 tokens per second)
llama_perf_context_print:       total time =     787.18 ms /    22 tokens

So whilst it works, it outputs � instead of £.

@kth8 Your linked huggingface instructions state to run inference using the llama-cli via git clone https://github.com/ggerganov/llama.cpp or brew install llama.cpp, however the BitNet repo uses the 3rdparty folder to clone this repo: https://github.com/Eddie-Wang1120/llama.cpp/tree/406a5036f9a8aaee9ec5e96e652f61691340fe95

In your dockerfile you recursively git clone this repo, so you're also using the 3rdparty forked repo instead of the latest llama.cpp as instructed in your huggingface model readme. So, perhaps it's an issue introduced whilst building the GGUF file (via setup_env.py) using the 3rdparty repo instead of an inference issue as your docker file doesn't run into the inference issue?

@kth8
Copy link

kth8 commented Nov 7, 2024

The Hugging Face repo is not mine. I just linked to it as reference for where I got the GGUF model file.

@grctest
Copy link
Author

grctest commented Nov 11, 2024

I've been trying out BitNet on debian via WSL2 & Docker, I can confirm that this issue does not occur on that environment, it must be a windows only issue? Here's my working environment: #110

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants