Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for tokenize and untokenize of UTF-8 encoding in prompt/output #87

Closed
wants to merge 8 commits into from

Conversation

wizd
Copy link

@wizd wizd commented Mar 13, 2023

The tokenization process of LLaMA is filled with magic numbers and not easily replicable. However, I have found that using the SentencePiece library works well. It's possible that the original LLaMA model also used SentencePiece for its tokenization.

test prompt: '我静静的坐在雨中,思考着'
I sit quietly in the rain, thinking

This sentence was heavily tokenized to <0x??>, making it very difficult to replicate.

Screenshot 2023-03-13 at 5 14 58 PM

@baifachuan
Copy link

baifachuan commented Mar 13, 2023

make编译不过;

     |                               ^~~~~~~~~~~
/usr/local/include/sentencepiece_processor.h:683:40: error: ‘absl::string_view’ has not been declared
  683 |   util::Status ParseExtraOptions(absl::string_view extra_option,
      |                                        ^~~~~~~~~~~
/usr/local/include/sentencepiece_processor.h:690:13: error: ‘absl::string_view’ has not been declared
  690 |       absl::string_view input, absl::string_view normalized,
      |             ^~~~~~~~~~~
/usr/local/include/sentencepiece_processor.h:690:38: error: ‘absl::string_view’ has not been declared
  690 |       absl::string_view input, absl::string_view normalized,
      |                                      ^~~~~~~~~~~
/usr/local/include/sentencepiece_processor.h:692:41: error: ‘string_view’ is not a member of ‘absl’
  692 |       const std::vector<std::pair<absl::string_view, int>> &result,
      |                                         ^~~~~~~~~~~
/usr/local/include/sentencepiece_processor.h:692:41: error: ‘string_view’ is not a member of ‘absl’
/usr/local/include/sentencepiece_processor.h:692:54: error: template argument 1 is invalid
  692 |       const std::vector<std::pair<absl::string_view, int>> &result,
      |                                                      ^~~
/usr/local/include/sentencepiece_processor.h:692:57: error: template argument 1 is invalid
  692 |       const std::vector<std::pair<absl::string_view, int>> &result,
      |                                                         ^~
/usr/local/include/sentencepiece_processor.h:692:57: error: template argument 2 is invalid
/usr/local/include/sentencepiece_processor.h:721:35: error: ‘string_view’ is not a member of ‘absl’
  721 | util::Status LoadModelProto(absl::string_view, ModelProto *model_proto);
      |                                   ^~~~~~~~~~~
/usr/local/include/sentencepiece_processor.h:721:59: error: expected primary-expression before ‘*’ token
  721 | util::Status LoadModelProto(absl::string_view, ModelProto *model_proto);
      |                                                           ^
/usr/local/include/sentencepiece_processor.h:721:60: error: ‘model_proto’ was not declared in this scope; did you mean ‘ModelProto’?
  721 | util::Status LoadModelProto(absl::string_view, ModelProto *model_proto);
      |                                                            ^~~~~~~~~~~
      |                                                            ModelProto
/usr/local/include/sentencepiece_processor.h:724:35: error: ‘string_view’ is not a member of ‘absl’
  724 | util::Status SaveModelProto(absl::string_view, const ModelProto &model_proto);
      |                                   ^~~~~~~~~~~
/usr/local/include/sentencepiece_processor.h:724:48: error: expected primary-expression before ‘const’
  724 | util::Status SaveModelProto(absl::string_view, const ModelProto &model_proto);
      |                                                ^~~~~
utils.cpp: In function ‘std::vector<int> llama_tokenize(const gpt_vocab&, const string&, bool)’:
utils.cpp:291:13: error: invalid conversion from ‘const char*’ to ‘int’ [-fpermissive]
  291 |     sp.Load("./models/tokenizer.model");
      |             ^~~~~~~~~~~~~~~~~~~~~~~~~~
      |             |
      |             const char*
In file included from utils.h:10,
                 from utils.cpp:1:
/usr/local/include/sentencepiece_processor.h:244:47: note:   initializing argument 1 of ‘virtual sentencepiece::util::Status sentencepiece::SentencePieceProcessor::Load(int)’
  244 |   virtual util::Status Load(absl::string_view filename);
      |                             ~~~~~~~~~~~~~~~~~~^~~~~~~~
utils.cpp:294:27: error: cannot convert ‘const string’ {aka ‘const std::__cxx11::basic_string<char>’} to ‘int’
  294 |     return sp.EncodeAsIds(text);
      |                           ^~~~
      |                           |
      |                           const string {aka const std::__cxx11::basic_string<char>}
In file included from utils.h:10,
                 from utils.cpp:1:
/usr/local/include/sentencepiece_processor.h:457:58: note:   initializing argument 1 of ‘virtual std::vector<int> sentencepiece::SentencePieceProcessor::EncodeAsIds(int) const’
  457 |   virtual std::vector<int> EncodeAsIds(absl::string_view input) const {
      |                                        ~~~~~~~~~~~~~~~~~~^~~~~
make: *** [Makefile:185: utils.o] Error 1

@wizd
Copy link
Author

wizd commented Mar 13, 2023

Need to add sentencepiece library manually
on macos:
https://github.com/google/sentencepiece#build-and-install-using-vcpkg

@lucasjinreal
Copy link

@wizd I using your fork, the intract mode is not work:

image

it was in dead loop.....

@ggerganov
Copy link
Owner

ggerganov commented Mar 13, 2023

Resolved in #79

@ggerganov ggerganov closed this Mar 13, 2023
@ggerganov
Copy link
Owner

Oh wait, did I get confused?
#79 does not resolve the tokenizer issues?

@kharvd
Copy link
Contributor

kharvd commented Mar 13, 2023

I think it does. Are you still able to reproduce the issues?

@ggerganov
Copy link
Owner

I reran the convert script and I get the following:

make -j && ./main -m models/13B/ggml-model-q4_0.bin -t 8 -n 64 -s 11 -p "我静静的坐在雨中,思考着"
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.0 (clang-1400.0.29.202)
I CXX:      Apple clang version 14.0.0 (clang-1400.0.29.202)

make: Nothing to be done for `default'.
main: seed = 11
llama_model_load: loading model from 'models/13B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 5120
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 13824
llama_model_load: n_parts = 2
llama_model_load: ggml ctx size = 8559.49 MB
llama_model_load: memory_size =   800.00 MB, n_mem = 20480
llama_model_load: loading model part 1/2 from 'models/13B/ggml-model-q4_0.bin'
llama_model_load: ............................................. done
llama_model_load: model size =  3880.49 MB / num tensors = 363
llama_model_load: loading model part 2/2 from 'models/13B/ggml-model-q4_0.bin.1'
llama_model_load: ............................................. done
llama_model_load: model size =  3880.49 MB / num tensors = 363

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 

main: prompt: '我静静的坐在雨中,思考着'
main: number of tokens in prompt = 2
     1 -> ''
 30672 -> '我'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


我们已经开始了。 (We've already begun.)
������行������。 (A camel caravan travels in a circle. )
The above-mentioned idioms and phrases are what I found on Chinese websites when googling

main: mem per token = 22439492 bytes
main:     load time =  2962.67 ms
main:   sample time =    59.34 ms
main:  predict time =  5717.17 ms / 87.96 ms per token
main:    total time = 10370.07 ms

There are 2 problems still:

  • The prompt is not converted to tokens
  • The generated text has invalid characters

@j-f1
Copy link
Collaborator

j-f1 commented Mar 13, 2023

Can you try running from shell script encoded as UTF-8 and outputting to a text file? Your terminal might not be handling Unicode correctly.

You’ll also need to re-generate your models from scratch since this PR changes how the ggml files are created.

@kharvd
Copy link
Contributor

kharvd commented Mar 13, 2023

@ggerganov just in case: did you re-run the quantization script as well?

@ggerganov
Copy link
Owner

@ggerganov just in case: did you re-run the quantization script as well?

Oops .. all good now 🦙

@wizzard0
Copy link
Contributor

wizzard0 commented Mar 13, 2023 via email

@lucasjinreal
Copy link

Does this merge into master? How to test it? The wizd's branch doesn't work well with intract mode.

@zhoujian1028
Copy link

@ggerganov
image
The prompt is not converted to tokens
How do you solve it? Thks!

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this pull request Dec 19, 2023
Fix TypeError in low_level chat
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants