Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alaways "failed to tokenize string! " #290

Closed
w1103693423 opened this issue Mar 19, 2023 · 7 comments
Closed

alaways "failed to tokenize string! " #290

w1103693423 opened this issue Mar 19, 2023 · 7 comments
Labels
bug Something isn't working

Comments

@w1103693423
Copy link

failed to tokenize string!

system_info: n_threads = 16 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
failed to tokenize string!

main: prompt: ' china'
main: number of tokens in prompt = 1
1 -> ''

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000

曲ー! /S部ュース / KSHErsLAheLUE - THE NEW CH`,MEgeERSION IS HERE@ÿThis entry was вер in news on JuneSASSSASS8 by adminS [end of text]

@sw
Copy link
Contributor

sw commented Mar 19, 2023

Can you provide the command line and a checksum of the model file?

@sw sw added the need more info The OP should provide more details about the issue label Mar 19, 2023
@Shimadaaaaa
Copy link

Shimadaaaaa commented Mar 20, 2023

same problem, ggml-model-q4_0.bin, md5sum is 919e4f8aee6ce4f3fbabb6cbcd7756db

@w1103693423
Copy link
Author

Can you provide the command line and a checksum of the model file?

./main -m ./models/7B/ggml-model-q4_0.bin -p "china" -n 512

checksum:
md5sum ggml-model-q4_0.bin
919e4f8aee6ce4f3fbabb6cbcd7756db ggml-model-q4_0.bin
6efc8dab194ab59e49cd24be5574d85e consolidated.00.pth

@sw
Copy link
Contributor

sw commented Mar 20, 2023

The files look good, though these are the "old" format, you'll have to regenerate them if you update to latest master.

There should be three tokens recognized with the old tokenizer:

main: prompt: ' china'
main: number of tokens in prompt = 3
     1 -> ''
 18558 -> ' chi'
  1056 -> 'na'

The new tokenizer gives different tokens:

main: prompt: ' china'
main: number of tokens in prompt = 3
     1 -> ''
   521 -> ' ch'
  1099 -> 'ina'

I really can't explain this, unless you have some strange terminal encoding set?

@sw sw removed the need more info The OP should provide more details about the issue label Mar 20, 2023
@gjmulder gjmulder added the bug Something isn't working label Mar 20, 2023
@w1103693423
Copy link
Author

image
encoding is LANG=en_US.UTF-8

@w1103693423
Copy link
Author

The files look good, though these are the "old" format, you'll have to regenerate them if you update to latest master.

There should be three tokens recognized with the old tokenizer:

main: prompt: ' china'
main: number of tokens in prompt = 3
     1 -> ''
 18558 -> ' chi'
  1056 -> 'na'

The new tokenizer gives different tokens:

main: prompt: ' china'
main: number of tokens in prompt = 3
     1 -> ''
   521 -> ' ch'
  1099 -> 'ina'

I really can't explain this, unless you have some strange terminal encoding set?

Thank you very much, it is available after I upgraded python version to 3.9 and pulled the latest master code and redeployed it。
image

@sw sw closed this as completed Apr 7, 2023
@sw
Copy link
Contributor

sw commented Apr 7, 2023

Possibly a duplicate of #113.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants