Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: Wrong maximum context length for qwen2.5-coder #3714

Closed
1 of 3 tasks
alexbfr opened this issue Sep 21, 2024 · 2 comments · Fixed by #3725
Closed
1 of 3 tasks

bug: Wrong maximum context length for qwen2.5-coder #3714

alexbfr opened this issue Sep 21, 2024 · 2 comments · Fixed by #3725
Assignees
Labels
type: bug Something isn't working

Comments

@alexbfr
Copy link

alexbfr commented Sep 21, 2024

Jan version

v0.5.4

Describe the Bug

Using qwen-2.5-coder-7b-instruct, Jan allows a maximum context length of 2048 tokens. However, both according to qwen's website as well as llama.cpp's output (most recent version from github at the time of writing), the maximum context length is 131072 tokens.

build: 3787 (6026da52) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
system info: n_threads = 47, n_threads_batch = 47, total_threads = 48

system_info: n_threads = 47 (n_threads_batch = 47) / 48 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 

main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 47
main: loading model
llama_model_loader: additional 2 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 29 key-value pairs and 339 tensors from [...]/qwen-2.5/qwen2.5-coder-7b-instruct-q8_0-00001-of-00003.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2.5 Coder 7B Instruct GGUF
llama_model_loader: - kv   3:                           general.finetune str              = Instruct-GGUF
llama_model_loader: - kv   4:                           general.basename str              = Qwen2.5-Coder
llama_model_loader: - kv   5:                         general.size_label str              = 7B
llama_model_loader: - kv   6:                          qwen2.block_count u32              = 28
llama_model_loader: - kv   7:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv   8:                     qwen2.embedding_length u32              = 3584
llama_model_loader: - kv   9:                  qwen2.feed_forward_length u32              = 18944
llama_model_loader: - kv  10:                 qwen2.attention.head_count u32              = 28
llama_model_loader: - kv  11:              qwen2.attention.head_count_kv u32              = 4
llama_model_loader: - kv  12:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  13:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                          general.file_type u32              = 7
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - kv  26:                                   split.no u16              = 0
llama_model_loader: - kv  27:                                split.count u16              = 3
llama_model_loader: - kv  28:                        split.tensors.count i32              = 339

Unfortunately this makes the model within Jan not really usable for coding related tasks.

Steps to Reproduce

  1. Install current Jan release (0.5.4) as debian package
  2. Download qwen-2.5-coder-7b-instruct 8b (https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct-GGUF/tree/main)
  3. Import model into Jan
  4. Select model in "Model" tab
  5. Scroll down to "Context Length" in "Model tab"
  6. 2048 is the maximum allowed value

To my understanding, even without ROPE 2048 is way below the trained context length of qwen-2.5 coder.

Screenshots / Logs

image

Unfortunately, there are no logs under ~/jan/logs (I did a find over all of my home folder for an app.log file, but none were found).

Maybe I'll checkout the repo and debug this myself later on.

What is your OS?

  • MacOS
  • Windows
  • Linux
@alexbfr alexbfr added the type: bug Something isn't working label Sep 21, 2024
@imtuyethan imtuyethan self-assigned this Sep 23, 2024
@imtuyethan imtuyethan moved this to Need Investigation in Jan & Cortex Sep 23, 2024
@imtuyethan imtuyethan assigned louis-jan and unassigned imtuyethan Sep 23, 2024
@imtuyethan
Copy link
Contributor

imtuyethan commented Sep 23, 2024

Model card says it supports ~32,768 tokens:

Screenshot 2024-09-23 at 4 20 20 PM

Seems like it's related to #2320
Or possibly related to #3558

@imtuyethan imtuyethan moved this from Need Investigation to Scheduled in Jan & Cortex Sep 23, 2024
@imtuyethan imtuyethan added this to the v0.5.5 milestone Sep 23, 2024
@imtuyethan imtuyethan removed the os: linux Linux issues label Sep 23, 2024
@github-project-automation github-project-automation bot moved this from In Review to Completed in Jan & Cortex Sep 24, 2024
@imtuyethan
Copy link
Contributor

LGTM on v0.5.4-650

Screenshot 2024-10-01 at 8 02 42 PM

@imtuyethan imtuyethan moved this from Review + QA to Completed in Jan & Cortex Oct 1, 2024
cnm13ryan pushed a commit to cnm13ryan/RepoAgent that referenced this issue Nov 12, 2024
From (janhq/jan#3714 (comment)),
we know that the context length for GGUF models are 32768.
The full context length of 131072, one has to refer to non-GGUF models.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants