Add support for tokenize and untokenize of UTF-8 encoding in prompt/output #87

wizd · 2023-03-13T09:17:30Z

The tokenization process of LLaMA is filled with magic numbers and not easily replicable. However, I have found that using the SentencePiece library works well. It's possible that the original LLaMA model also used SentencePiece for its tokenization.

test prompt: '我静静的坐在雨中，思考着'
I sit quietly in the rain, thinking

This sentence was heavily tokenized to <0x??>, making it very difficult to replicate.

baifachuan · 2023-03-13T11:00:47Z

make编译不过；

     |                               ^~~~~~~~~~~
/usr/local/include/sentencepiece_processor.h:683:40: error: ‘absl::string_view’ has not been declared
  683 |   util::Status ParseExtraOptions(absl::string_view extra_option,
      |                                        ^~~~~~~~~~~
/usr/local/include/sentencepiece_processor.h:690:13: error: ‘absl::string_view’ has not been declared
  690 |       absl::string_view input, absl::string_view normalized,
      |             ^~~~~~~~~~~
/usr/local/include/sentencepiece_processor.h:690:38: error: ‘absl::string_view’ has not been declared
  690 |       absl::string_view input, absl::string_view normalized,
      |                                      ^~~~~~~~~~~
/usr/local/include/sentencepiece_processor.h:692:41: error: ‘string_view’ is not a member of ‘absl’
  692 |       const std::vector<std::pair<absl::string_view, int>> &result,
      |                                         ^~~~~~~~~~~
/usr/local/include/sentencepiece_processor.h:692:41: error: ‘string_view’ is not a member of ‘absl’
/usr/local/include/sentencepiece_processor.h:692:54: error: template argument 1 is invalid
  692 |       const std::vector<std::pair<absl::string_view, int>> &result,
      |                                                      ^~~
/usr/local/include/sentencepiece_processor.h:692:57: error: template argument 1 is invalid
  692 |       const std::vector<std::pair<absl::string_view, int>> &result,
      |                                                         ^~
/usr/local/include/sentencepiece_processor.h:692:57: error: template argument 2 is invalid
/usr/local/include/sentencepiece_processor.h:721:35: error: ‘string_view’ is not a member of ‘absl’
  721 | util::Status LoadModelProto(absl::string_view, ModelProto *model_proto);
      |                                   ^~~~~~~~~~~
/usr/local/include/sentencepiece_processor.h:721:59: error: expected primary-expression before ‘*’ token
  721 | util::Status LoadModelProto(absl::string_view, ModelProto *model_proto);
      |                                                           ^
/usr/local/include/sentencepiece_processor.h:721:60: error: ‘model_proto’ was not declared in this scope; did you mean ‘ModelProto’?
  721 | util::Status LoadModelProto(absl::string_view, ModelProto *model_proto);
      |                                                            ^~~~~~~~~~~
      |                                                            ModelProto
/usr/local/include/sentencepiece_processor.h:724:35: error: ‘string_view’ is not a member of ‘absl’
  724 | util::Status SaveModelProto(absl::string_view, const ModelProto &model_proto);
      |                                   ^~~~~~~~~~~
/usr/local/include/sentencepiece_processor.h:724:48: error: expected primary-expression before ‘const’
  724 | util::Status SaveModelProto(absl::string_view, const ModelProto &model_proto);
      |                                                ^~~~~
utils.cpp: In function ‘std::vector<int> llama_tokenize(const gpt_vocab&, const string&, bool)’:
utils.cpp:291:13: error: invalid conversion from ‘const char*’ to ‘int’ [-fpermissive]
  291 |     sp.Load("./models/tokenizer.model");
      |             ^~~~~~~~~~~~~~~~~~~~~~~~~~
      |             |
      |             const char*
In file included from utils.h:10,
                 from utils.cpp:1:
/usr/local/include/sentencepiece_processor.h:244:47: note:   initializing argument 1 of ‘virtual sentencepiece::util::Status sentencepiece::SentencePieceProcessor::Load(int)’
  244 |   virtual util::Status Load(absl::string_view filename);
      |                             ~~~~~~~~~~~~~~~~~~^~~~~~~~
utils.cpp:294:27: error: cannot convert ‘const string’ {aka ‘const std::__cxx11::basic_string<char>’} to ‘int’
  294 |     return sp.EncodeAsIds(text);
      |                           ^~~~
      |                           |
      |                           const string {aka const std::__cxx11::basic_string<char>}
In file included from utils.h:10,
                 from utils.cpp:1:
/usr/local/include/sentencepiece_processor.h:457:58: note:   initializing argument 1 of ‘virtual std::vector<int> sentencepiece::SentencePieceProcessor::EncodeAsIds(int) const’
  457 |   virtual std::vector<int> EncodeAsIds(absl::string_view input) const {
      |                                        ~~~~~~~~~~~~~~~~~~^~~~~
make: *** [Makefile:185: utils.o] Error 1

wizd · 2023-03-13T11:38:10Z

Need to add sentencepiece library manually
on macos:
https://github.com/google/sentencepiece#build-and-install-using-vcpkg

lucasjinreal · 2023-03-13T12:30:13Z

@wizd I using your fork, the intract mode is not work:

it was in dead loop.....

ggerganov · 2023-03-13T16:47:04Z

Resolved in #79

ggerganov · 2023-03-13T18:37:14Z

Oh wait, did I get confused?
#79 does not resolve the tokenizer issues?

kharvd · 2023-03-13T18:40:28Z

I think it does. Are you still able to reproduce the issues?

ggerganov · 2023-03-13T18:51:10Z

I reran the convert script and I get the following:

make -j && ./main -m models/13B/ggml-model-q4_0.bin -t 8 -n 64 -s 11 -p "我静静的坐在雨中，思考着"
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.0 (clang-1400.0.29.202)
I CXX:      Apple clang version 14.0.0 (clang-1400.0.29.202)

make: Nothing to be done for `default'.
main: seed = 11
llama_model_load: loading model from 'models/13B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 5120
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 13824
llama_model_load: n_parts = 2
llama_model_load: ggml ctx size = 8559.49 MB
llama_model_load: memory_size =   800.00 MB, n_mem = 20480
llama_model_load: loading model part 1/2 from 'models/13B/ggml-model-q4_0.bin'
llama_model_load: ............................................. done
llama_model_load: model size =  3880.49 MB / num tensors = 363
llama_model_load: loading model part 2/2 from 'models/13B/ggml-model-q4_0.bin.1'
llama_model_load: ............................................. done
llama_model_load: model size =  3880.49 MB / num tensors = 363

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 

main: prompt: '我静静的坐在雨中，思考着'
main: number of tokens in prompt = 2
     1 -> ''
 30672 -> '我'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


我们已经开始了。 (We've already begun.)
������行������。 (A camel caravan travels in a circle. )
The above-mentioned idioms and phrases are what I found on Chinese websites when googling

main: mem per token = 22439492 bytes
main:     load time =  2962.67 ms
main:   sample time =    59.34 ms
main:  predict time =  5717.17 ms / 87.96 ms per token
main:    total time = 10370.07 ms

There are 2 problems still:

The prompt is not converted to tokens
The generated text has invalid characters

j-f1 · 2023-03-13T19:00:13Z

Can you try running from shell script encoded as UTF-8 and outputting to a text file? Your terminal might not be handling Unicode correctly.

You’ll also need to re-generate your models from scratch since this PR changes how the ggml files are created.

kharvd · 2023-03-13T19:00:45Z

@ggerganov just in case: did you re-run the quantization script as well?

ggerganov · 2023-03-13T19:08:12Z

@ggerganov just in case: did you re-run the quantization script as well?

Oops .. all good now 🦙

wizzard0 · 2023-03-13T22:14:51Z

suggestion: can we add a magic version number? i feel we’ll get further updates?

…

On Mon, Mar 13, 2023 at 21:08, Georgi Gerganov ***@***.***> wrote: > ***@***.***(https://github.com/ggerganov) just in case: did you re-run the quantization script as well? Oops .. all good now 🦙 — Reply to this email directly, [view it on GitHub](#87 (comment)), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/AADHVK435NNAH7DLEKXMTEDW35WCRANCNFSM6AAAAAAVYZNR34). You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

lucasjinreal · 2023-03-14T02:21:10Z

Does this merge into master? How to test it? The wizd's branch doesn't work well with intract mode.

zhoujian1028 · 2023-03-14T04:48:24Z

@ggerganov

The prompt is not converted to tokens
How do you solve it? Thks!

Fix TypeError in low_level chat

wizd added 6 commits March 13, 2023 10:00

first try to intergrate sentencepiece

307dba3

call a standalone function to untokenize output

1b87fe1

buffering output for UTF-8 encoded token

86e967c

buffering utf-8 output to make it complete for spliting output.

15f06f6

clean code

ed10def

Merge branch 'ggerganov:master' into master

7438b83

This was referenced Mar 13, 2023

Unicode support #11

Closed

Chinese character decoding error when intract way #86

Closed

add support to load tokenizer.model from command line argument

6b9e424

remove unused header

a1eff53

ggerganov closed this Mar 13, 2023

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this pull request Dec 19, 2023

Merge pull request ggerganov#87 from SagsMug/main

4ce6670

Fix TypeError in low_level chat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for tokenize and untokenize of UTF-8 encoding in prompt/output #87

Add support for tokenize and untokenize of UTF-8 encoding in prompt/output #87

wizd commented Mar 13, 2023

baifachuan commented Mar 13, 2023 •

edited

Loading

wizd commented Mar 13, 2023

lucasjinreal commented Mar 13, 2023

ggerganov commented Mar 13, 2023 •

edited

Loading

ggerganov commented Mar 13, 2023

kharvd commented Mar 13, 2023

ggerganov commented Mar 13, 2023

j-f1 commented Mar 13, 2023 •

edited

Loading

kharvd commented Mar 13, 2023

ggerganov commented Mar 13, 2023

wizzard0 commented Mar 13, 2023 via email

lucasjinreal commented Mar 14, 2023

zhoujian1028 commented Mar 14, 2023

Add support for tokenize and untokenize of UTF-8 encoding in prompt/output #87

Add support for tokenize and untokenize of UTF-8 encoding in prompt/output #87

Conversation

wizd commented Mar 13, 2023

baifachuan commented Mar 13, 2023 • edited Loading

wizd commented Mar 13, 2023

lucasjinreal commented Mar 13, 2023

ggerganov commented Mar 13, 2023 • edited Loading

ggerganov commented Mar 13, 2023

kharvd commented Mar 13, 2023

ggerganov commented Mar 13, 2023

j-f1 commented Mar 13, 2023 • edited Loading

kharvd commented Mar 13, 2023

ggerganov commented Mar 13, 2023

wizzard0 commented Mar 13, 2023 via email

lucasjinreal commented Mar 14, 2023

zhoujian1028 commented Mar 14, 2023

baifachuan commented Mar 13, 2023 •

edited

Loading

ggerganov commented Mar 13, 2023 •

edited

Loading

j-f1 commented Mar 13, 2023 •

edited

Loading