vocab: refactor tokenizer to reduce the overhead of creating multi times tokenizer #9449

kylo5aby · 2024-09-12T09:40:20Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

the PR mainly does:

create a tokenizer object after initialize the llama_vocab
put the initialization of mutable data of each tokenizer object in llama_tokenize_internal for thread safe

ngxson

This technically works, but IMO we could improve it a bit:

Currently, llm_tokenizer_data to store working data and llm_tokenizer to store "fixed" shared data. However, because tokenize() is still inside llm_tokenizer, there is nothing prevent it from mutating data inside llm_tokenizer (which is not desirable).

A better solution would be:

Move tokenize() function to llm_tokenizer_data (maybe change the class name to llm_tokenizer_session to reflect that the object is short-loved)
tokenize() take const llm_tokenizer * as argument. The const is to make sure that the shared object is read-only.

kylo5aby · 2024-09-14T02:56:51Z

This technically works, but IMO we could improve it a bit:

Currently, llm_tokenizer_data to store working data and llm_tokenizer to store "fixed" shared data. However, because tokenize() is still inside llm_tokenizer, there is nothing prevent it from mutating data inside llm_tokenizer (which is not desirable).

A better solution would be:

Move tokenize() function to llm_tokenizer_data (maybe change the class name to llm_tokenizer_session to reflect that the object is short-loved)

tokenize() take const llm_tokenizer * as argument. The const is to make sure that the shared object is read-only.

@ngxson Thanks your feedback, what approach do you think is better:

declare tokenize in llm_tokenizer_session, and pass const llm_tokenizer * tokenizer, and call tokenizer's related method in the inner, that means the logic implementation(eg. tokenize, append) is still belong to every llm_tokenizer. For example:

struct llm_tokenizer_bpe_session : llm_tokenizer_session {

    void tokenize(const llm_tokenizer * tokenizer, const std::string & text, std::vector<llama_vocab::id> & output) {
        tokenizer->tokenize(text, *this, output);
    }

    std::vector<llm_symbol> symbols;
    std::vector<llm_symbol> symbols_final;
    llm_bigram_bpe::queue work_queue;
};

Keep the structures as in current PR, and declare all methods as const for every llm_tokenizer , so the shared data in every llm_tokenizer are not allowed to modify. That maybe bring smaller change.
move all the logic implementation from every llm_tokenizer to llm_tokenizer_session, the llm_tokenizer will only contains shared data, every time want to use tokenize operations, it must create a llm_tokenizer_session, which maybe bring a break change.

ngxson · 2024-09-14T08:15:53Z

The 3rd option should be the proper way. The idea is:

llm_tokenizer contains only fixed/shared data. It will be initialized with the model, stored in model.tokenizer
Functions using the tokenizer like llama_tokenize_impl, llama_token_to_piece_impl, llama_detokenize_impl will take const llm_tokenizer & as input. They no longer take const vocab as a param because vocab is already inside const llm_tokenizer

every time want to use tokenize operations, it must create a llm_tokenizer_session, which maybe bring a break change.

I don't see why it's a breaking change compared to what you're currently doing with llm_tokenizer_data. Every time the tokenizer is used, a new llm_tokenizer_session must be created.

The constructor of the session will look like: llm_tokenizer_session(const llm_tokenizer & tokenizer)

src/llama-vocab.h

ngxson · 2024-09-19T11:39:56Z

src/llama-vocab.cpp

+            tokenizer = new llm_tokenizer_rwkv(vocab);
+            break;
+        default:
+            GGML_ABORT("fatal error");


Suggested change

GGML_ABORT("fatal error");

GGML_ABORT("unknown vocab type");

src/llama-vocab.cpp

ggerganov

@kylo5aby Are you interested in adding a test that accepts a vocab file (see ./models/ggml-vocab*.gguf and tokenizes random strings in parallel on multiple threads? The test-log test can be used as a starting point. The goal is to run it through thread sanitizer in order to guarantee thread-safety of the tokenization API.

ggerganov · 2024-09-19T16:27:00Z

src/llama-vocab.cpp

-struct llm_tokenizer_wpm {
-    llm_tokenizer_wpm(const llama_vocab & vocab): vocab(vocab) {}
+struct llm_tokenizer_wpm : llm_tokenizer {
+    llm_tokenizer_wpm(const llama_vocab & vocab): llm_tokenizer(vocab) {}


Suggested change

llm_tokenizer_wpm(const llama_vocab & vocab): llm_tokenizer(vocab) {}

llm_tokenizer_wpm(const llama_vocab & vocab) : llm_tokenizer(vocab) {}

ggerganov · 2024-09-19T16:27:09Z

src/llama-vocab.cpp

-struct llm_tokenizer_bpe {
-    llm_tokenizer_bpe(const llama_vocab & vocab): vocab(vocab) {
+struct llm_tokenizer_bpe : llm_tokenizer {
+    llm_tokenizer_bpe(const llama_vocab & vocab): llm_tokenizer(vocab) {


Suggested change

llm_tokenizer_bpe(const llama_vocab & vocab): llm_tokenizer(vocab) {

llm_tokenizer_bpe(const llama_vocab & vocab) : llm_tokenizer(vocab) {

ggerganov · 2024-09-20T07:21:55Z

tests/CMakeLists.txt

+# build test-tokenizer-parallel target once and add many tests
+add_executable(test-tokenizer-parallel test-tokenizer-parallel.cpp)
+target_link_libraries(test-tokenizer-parallel PRIVATE common)
+install(TARGETS test-tokenizer-parallel RUNTIME)
+
+llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-bert-bge          ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-bert-bge.gguf)
+llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-command-r         ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-command-r.gguf)
+llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-deepseek-coder    ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-deepseek-coder.gguf)
+llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-deepseek-llm      ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-deepseek-llm.gguf)
+llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-falcon            ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-falcon.gguf)
+llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-gpt-2             ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-gpt-2.gguf)
+llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-llama-bpe         ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-llama-bpe.gguf)
+llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-llama-spm         ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-llama-spm.gguf)
+llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-mpt               ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-mpt.gguf)
+llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-phi-3             ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-phi-3.gguf)
+llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-qwen2             ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-qwen2.gguf)
+llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-refact            ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-refact.gguf)
+llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-starcoder         ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-starcoder.gguf)
+


Let's improve this a bit further - looking at your changes I realized we don't need to create a separate test. We can simply extend the existing test-tokenizer-0 to become multi-threaded. You've pretty much done it in test-tokenizer-parallel.cpp, but just need to store the results and print them to stdout/stderr after joining the threads. We also want to keep support for optional file tokenization at the end - this remains single-threaded.

Let's improve this a bit further - looking at your changes I realized we don't need to create a separate test. We can simply extend the existing test-tokenizer-0 to become multi-threaded. You've pretty much done it in test-tokenizer-parallel.cpp, but just need to store the results and print them to stdout/stderr after joining the threads. We also want to keep support for optional file tokenization at the end - this remains single-threaded.

how about introduce a mutex to make the printing of results orderly?

how about introduce a mutex to make the printing of results orderly?

Nah, it would still be scrambled. Since there aren't that many tests anyway, maybe it's simpler to have all threads compute all tests in parallel and only the first thread to print the results.

Reasonable. I'll update test-tokenizer-0 and makes it only print the result of thread 0, if there are any data race, the thread sanitizer will post warnings or abort.

ggml-ci

…lama.cpp into ggerganov-gg/tokenizer-cleanup

ggml-ci

kylo5aby · 2024-09-23T07:47:00Z

I found there is a CI error on windows(clang), as mentioned in #9557, that due to symbol confict, after converting the types in examples, this would be solved.

ggerganov · 2024-09-23T12:16:18Z

To fix the CI and enable merge, rename llama_vocab to my_llama_vocab here:

llama.cpp/examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.cpp

Lines 203 to 207 in 37f8c7b

    
           struct llama_vocab { 
        
               using id    = int32_t; 
        
               using token = std::string; 
        
               using ttype = llama_token_type;

src/llama.cpp

Nekotekina · 2024-09-24T13:17:38Z

Thank you very much, without this PR, tokenization in e5-like models takes insane amount of time. Now it's nearly instant.

* refactor tokenizer * llama : make llm_tokenizer more private ggml-ci * refactor tokenizer * refactor tokenizer * llama : make llm_tokenizer more private ggml-ci * remove unused files * remove unused fileds to avoid unused filed build error * avoid symbol link error * Update src/llama.cpp * Update src/llama.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

kylo5aby changed the title ~~refactor tokenizer to reduce the overhead of creating multi times tokenizer~~ vocab: refactor tokenizer to reduce the overhead of creating multi times tokenizer Sep 12, 2024

kylo5aby force-pushed the refactor-tokenizer branch from 5447152 to 9ce57e0 Compare September 12, 2024 10:13

ngxson reviewed Sep 12, 2024

View reviewed changes

kylo5aby force-pushed the refactor-tokenizer branch from 9ce57e0 to 0166d83 Compare September 18, 2024 09:52

ngxson reviewed Sep 18, 2024

View reviewed changes

src/llama-vocab.h Outdated Show resolved Hide resolved

kylo5aby force-pushed the refactor-tokenizer branch from 0166d83 to 5abee9c Compare September 19, 2024 08:58

ngxson approved these changes Sep 19, 2024

View reviewed changes

ngxson requested a review from ggerganov September 19, 2024 11:49

ggerganov reviewed Sep 19, 2024

View reviewed changes

ggerganov mentioned this pull request Sep 19, 2024

llama : refactor llama_vocab #9369

Closed

refactor tokenizer

d949c58

kylo5aby force-pushed the refactor-tokenizer branch from 5abee9c to d949c58 Compare September 20, 2024 07:05

ggerganov reviewed Sep 20, 2024

View reviewed changes

github-actions bot added the testing Everything test related label Sep 20, 2024

ggerganov and others added 2 commits September 20, 2024 11:41

llama : make llm_tokenizer more private

6e873e5

ggml-ci

refactor tokenizer

403758f

kylo5aby force-pushed the refactor-tokenizer branch from d949c58 to 403758f Compare September 20, 2024 09:19

kylo5aby and others added 5 commits September 20, 2024 18:38

Merge branch 'gg/tokenizer-cleanup' of https://github.com/ggerganov/l…

d653d25

…lama.cpp into ggerganov-gg/tokenizer-cleanup

refactor tokenizer

2ec25db

llama : make llm_tokenizer more private

02629d9

ggml-ci

Merge branch 'ggerganov-gg/tokenizer-cleanup' into refactor-tokenizer

8be5d11

remove unused files

25d4599

ggerganov mentioned this pull request Sep 20, 2024

baby-llama : use unnamed namespace in baby_llama_layer #9557

Open

4 tasks

remove unused fileds to avoid unused filed build error

768c43f

avoid symbol link error

21ee380

github-actions bot added the examples label Sep 24, 2024

ggerganov approved these changes Sep 24, 2024

View reviewed changes

src/llama.cpp Outdated Show resolved Hide resolved

src/llama.cpp Outdated Show resolved Hide resolved

ggerganov added 2 commits September 25, 2024 16:19

Update src/llama.cpp

95e433c

Update src/llama.cpp

cd145b1

ggerganov merged commit 6102037 into ggerganov:master Sep 28, 2024
53 checks passed

ggerganov added a commit that referenced this pull request Sep 28, 2024

llama : add comment about thread-safety [no ci] (#9449)

7398427

matiaslin pushed a commit to matiaslin/llama.cpp that referenced this pull request Sep 28, 2024

llama : add comment about thread-safety [no ci] (ggerganov#9449)

5c6b792

dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024

llama : add comment about thread-safety [no ci] (ggerganov#9449)

fcefb5b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vocab: refactor tokenizer to reduce the overhead of creating multi times tokenizer #9449

vocab: refactor tokenizer to reduce the overhead of creating multi times tokenizer #9449

kylo5aby commented Sep 12, 2024 •

edited by ggerganov

Loading

ngxson left a comment •

edited

Loading

kylo5aby commented Sep 14, 2024

ngxson commented Sep 14, 2024 •

edited

Loading

ngxson Sep 19, 2024

ggerganov left a comment •

edited

Loading

ggerganov Sep 19, 2024

ggerganov Sep 19, 2024

ggerganov Sep 20, 2024

kylo5aby Sep 20, 2024

ggerganov Sep 20, 2024

kylo5aby Sep 20, 2024

kylo5aby commented Sep 23, 2024

ggerganov commented Sep 23, 2024

Nekotekina commented Sep 24, 2024

	llm_tokenizer_wpm(const llama_vocab & vocab): llm_tokenizer(vocab) {}
	llm_tokenizer_wpm(const llama_vocab & vocab) : llm_tokenizer(vocab) {}

	llm_tokenizer_bpe(const llama_vocab & vocab): llm_tokenizer(vocab) {
	llm_tokenizer_bpe(const llama_vocab & vocab) : llm_tokenizer(vocab) {

vocab: refactor tokenizer to reduce the overhead of creating multi times tokenizer #9449

vocab: refactor tokenizer to reduce the overhead of creating multi times tokenizer #9449

Conversation

kylo5aby commented Sep 12, 2024 • edited by ggerganov Loading

ngxson left a comment • edited Loading

Choose a reason for hiding this comment

kylo5aby commented Sep 14, 2024

ngxson commented Sep 14, 2024 • edited Loading

ngxson Sep 19, 2024

Choose a reason for hiding this comment

ggerganov left a comment • edited Loading

Choose a reason for hiding this comment

ggerganov Sep 19, 2024

Choose a reason for hiding this comment

ggerganov Sep 19, 2024

Choose a reason for hiding this comment

ggerganov Sep 20, 2024

Choose a reason for hiding this comment

kylo5aby Sep 20, 2024

Choose a reason for hiding this comment

ggerganov Sep 20, 2024

Choose a reason for hiding this comment

kylo5aby Sep 20, 2024

Choose a reason for hiding this comment

kylo5aby commented Sep 23, 2024

ggerganov commented Sep 23, 2024

Nekotekina commented Sep 24, 2024

kylo5aby commented Sep 12, 2024 •

edited by ggerganov

Loading

ngxson left a comment •

edited

Loading

ngxson commented Sep 14, 2024 •

edited

Loading

ggerganov left a comment •

edited

Loading