alaways "failed to tokenize string! " #290

w1103693423 · 2023-03-19T11:29:50Z

failed to tokenize string!

system_info: n_threads = 16 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
failed to tokenize string!

main: prompt: ' china'
main: number of tokens in prompt = 1
1 -> ''

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000

曲ー！ /S部ュース / KSHErsLAheLUE - THE NEW CH`,MEgeERSION IS HERE@ÿThis entry was вер in news on JuneSASSSASS8 by adminS [end of text]

sw · 2023-03-19T11:37:31Z

Can you provide the command line and a checksum of the model file?

Shimadaaaaa · 2023-03-20T08:59:13Z

same problem, ggml-model-q4_0.bin, md5sum is 919e4f8aee6ce4f3fbabb6cbcd7756db

w1103693423 · 2023-03-20T10:40:25Z

Can you provide the command line and a checksum of the model file?

./main -m ./models/7B/ggml-model-q4_0.bin -p "china" -n 512

checksum:
md5sum ggml-model-q4_0.bin
919e4f8aee6ce4f3fbabb6cbcd7756db ggml-model-q4_0.bin
6efc8dab194ab59e49cd24be5574d85e consolidated.00.pth

sw · 2023-03-20T19:13:17Z

The files look good, though these are the "old" format, you'll have to regenerate them if you update to latest master.

There should be three tokens recognized with the old tokenizer:

main: prompt: ' china'
main: number of tokens in prompt = 3
     1 -> ''
 18558 -> ' chi'
  1056 -> 'na'

The new tokenizer gives different tokens:

main: prompt: ' china'
main: number of tokens in prompt = 3
     1 -> ''
   521 -> ' ch'
  1099 -> 'ina'

I really can't explain this, unless you have some strange terminal encoding set?

w1103693423 · 2023-03-22T06:48:33Z

encoding is LANG=en_US.UTF-8

w1103693423 · 2023-03-22T09:38:47Z

The files look good, though these are the "old" format, you'll have to regenerate them if you update to latest master.

There should be three tokens recognized with the old tokenizer:
main: prompt: ' china'
main: number of tokens in prompt = 3
     1 -> ''
 18558 -> ' chi'
  1056 -> 'na'
The new tokenizer gives different tokens:
main: prompt: ' china'
main: number of tokens in prompt = 3
     1 -> ''
   521 -> ' ch'
  1099 -> 'ina'
I really can't explain this, unless you have some strange terminal encoding set?

Thank you very much, it is available after I upgraded python version to 3.9 and pulled the latest master code and redeployed it。

sw · 2023-04-07T16:17:08Z

Possibly a duplicate of #113.

sw added the need more info The OP should provide more details about the issue label Mar 19, 2023

sw removed the need more info The OP should provide more details about the issue label Mar 20, 2023

gjmulder added the bug Something isn't working label Mar 20, 2023

sw closed this as completed Apr 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

alaways "failed to tokenize string! " #290

alaways "failed to tokenize string! " #290

w1103693423 commented Mar 19, 2023

sw commented Mar 19, 2023

Shimadaaaaa commented Mar 20, 2023 •

edited

Loading

w1103693423 commented Mar 20, 2023

sw commented Mar 20, 2023

w1103693423 commented Mar 22, 2023

w1103693423 commented Mar 22, 2023

sw commented Apr 7, 2023 •

edited

Loading

alaways "failed to tokenize string! " #290

alaways "failed to tokenize string! " #290

Comments

w1103693423 commented Mar 19, 2023

sw commented Mar 19, 2023

Shimadaaaaa commented Mar 20, 2023 • edited Loading

w1103693423 commented Mar 20, 2023

sw commented Mar 20, 2023

w1103693423 commented Mar 22, 2023

w1103693423 commented Mar 22, 2023

sw commented Apr 7, 2023 • edited Loading

Shimadaaaaa commented Mar 20, 2023 •

edited

Loading

sw commented Apr 7, 2023 •

edited

Loading